<a href="https://colab.research.google.com/github/naveenkvarma/AI_ML_Internship/blob/main/Week_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Summary**

The provided code performs a machine learning task using a dataset loaded.


1. **Data Loading and Preprocessing:**
    Loads the dataset into a pandas DataFrame.
    Performs one-hot encoding on categorical features ('Value for money', 'Bestseller in which Country based on the sales of their stock').
    Scales numerical features ('Price', 'Total Profits', etc.) using StandardScaler.

2. **Model Training and Evaluation:**
    Splits the dataset into training and testing sets (80% train, 20% test).
    Trains and evaluates multiple classification models (Logistic Regression, Decision Tree, Random Forest, SVM, Gaussian Naive Bayes, K-Nearest Neighbors) on the training data.
    Prints accuracy, classification report, and confusion matrix for each model on the test data.

3. **Hyperparameter Tuning:**
    Performs grid search cross-validation to find optimal hyperparameters for a Random Forest model (tuning 'n_estimators' and 'max_depth').
    Prints the best hyperparameters and corresponding score.

4. **Final Model Training:**
    Trains a final Random Forest model using the best hyperparameters found in the previous step.



In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Load the dataset
df = pd.read_csv('/content/smp dataset.csv')

In [None]:
df.head(10)

Unnamed: 0,Price,Value for money,Total Profits,Total number of people in use of their services,Total Sales since launch,Bestseller in which Country based on the sales of their stock,Best in Tech for both based on the sales of the stock,Stock Value,Total Number of average stocks Bought In total as of today
0,150.25,medium,250000000.0,12000000,6000000,USA,Apple,160.5,5000000
1,120.0,high,30000000.0,8000000,4000000,China,Samsung,135.0,6000000
2,125.5,medium,15000000000.0,200000000,3000000000,India,Apple,130.0,50000000
3,78.9,high,1000000000.0,15000000,25000000,Brazil,Samsung,80.0,40000000
4,95.2,low,1200000000.0,18000000,3000000,China,Apple,100.0,60000000
5,87.6,medium,1400000000.0,22000000,4000000,USA,Samsung,90.0,70000000
6,112.3,high,1800000000.0,25000000,5000000,Germany,Apple,120.0,80000000
7,65.4,low,80000000.0,12000000,2000000,Japan,Samsung,70.0,30000000
8,101.7,medium,1600000000.0,23000000,4500000,Canada,Apple,110.0,90000000
9,72.8,high,90000000.0,14000000,2500000,Australia,Samsung,75.0,25000000


In [None]:
df.info(10)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 9 columns):
 #   Column                                                         Non-Null Count  Dtype  
---  ------                                                         --------------  -----  
 0   Price                                                          100 non-null    float64
 1   Value for money                                                100 non-null    object 
 2   Total Profits                                                  100 non-null    float64
 3   Total number of people in use of their services                100 non-null    int64  
 4   Total Sales since launch                                       100 non-null    int64  
 5   Bestseller in which Country based on the sales of their stock  100 non-null    object 
 6   Best in Tech for both based on the sales of the stock          100 non-null    object 
 7   Stock Value                                                    1

In [None]:
df.describe()

Unnamed: 0,Price,Total Profits,Total number of people in use of their services,Total Sales since launch,Stock Value,Total Number of average stocks Bought In total as of today
count,100.0,100.0,100.0,100.0,100.0,100.0
mean,124.7534,1234141000.0,39243000.0,303621800.0,152.5259,11464810.0
std,34.047861,1849793000.0,52978170.0,629158900.0,47.360356,23371470.0
min,50.0,234567.8,123456.0,98765.0,70.0,21098.0
25%,101.275,68750000.0,4225000.0,5000000.0,120.0,1425000.0
50%,122.075,500000000.0,8000000.0,14250000.0,145.0,2000000.0
75%,150.0,2125000000.0,81250000.0,405000000.0,182.5,5250000.0
max,210.45,15000000000.0,200000000.0,4000000000.0,300.0,100000000.0


In [None]:
df = pd.get_dummies(df, columns=['Value for money', 'Bestseller in which Country based on the sales of their stock'], prefix=['Value_for_money', 'Bestseller_country'])

In [None]:
# Scale the numerical features
scaler = StandardScaler()
df[['Price', 'Total Profits', 'Total number of people in use of their services', 'Total Sales since launch', 'Stock Value', 'Total Number of average stocks Bought In total as of today']] = scaler.fit_transform(df[['Price', 'Total Profits', 'Total number of people in use of their services', 'Total Sales since launch', 'Stock Value', 'Total Number of average stocks Bought In total as of today']])

In [None]:
X = df.drop('Best in Tech for both based on the sales of the stock', axis=1)
y = df['Best in Tech for both based on the sales of the stock']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier

models = [
    LogisticRegression(),
    DecisionTreeClassifier(),
    RandomForestClassifier(),
    SVC(),
    GaussianNB(),
    KNeighborsClassifier()
]

for model in models:
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    print(f"Model: {model.__class__.__name__}")
    print(f"Accuracy: {accuracy_score(y_test, y_pred):.3f}")
    print(f"Classification Report:\n{classification_report(y_test, y_pred)}")
    print(f"Confusion Matrix:\n{confusion_matrix(y_test, y_pred)}")
    print()

Model: LogisticRegression
Accuracy: 0.300
Classification Report:
              precision    recall  f1-score   support

       Apple       0.30      0.30      0.30        10
     Samsung       0.30      0.30      0.30        10

    accuracy                           0.30        20
   macro avg       0.30      0.30      0.30        20
weighted avg       0.30      0.30      0.30        20

Confusion Matrix:
[[3 7]
 [7 3]]

Model: DecisionTreeClassifier
Accuracy: 0.650
Classification Report:
              precision    recall  f1-score   support

       Apple       0.64      0.70      0.67        10
     Samsung       0.67      0.60      0.63        10

    accuracy                           0.65        20
   macro avg       0.65      0.65      0.65        20
weighted avg       0.65      0.65      0.65        20

Confusion Matrix:
[[7 3]
 [4 6]]

Model: RandomForestClassifier
Accuracy: 0.800
Classification Report:
              precision    recall  f1-score   support

       Apple       0

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Classification Report:
              precision    recall  f1-score   support

       Apple       0.50      0.80      0.62        10
     Samsung       1.00      0.10      0.18        10
 South Korea       0.00      0.00      0.00         0

    accuracy                           0.45        20
   macro avg       0.50      0.30      0.27        20
weighted avg       0.75      0.45      0.40        20

Confusion Matrix:
[[8 0 2]
 [8 1 1]
 [0 0 0]]

Model: KNeighborsClassifier
Accuracy: 0.650
Classification Report:
              precision    recall  f1-score   support

       Apple       0.62      0.80      0.70        10
     Samsung       0.71      0.50      0.59        10

    accuracy                           0.65        20
   macro avg       0.66      0.65      0.64        20
weighted avg       0.66      0.65      0.64        20

Confusion Matrix:
[[8 2]
 [5 5]]



In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = {'n_estimators': [10, 50, 100], 'max_depth': [None, 5, 10]}
grid_search = GridSearchCV(RandomForestClassifier(), param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

print(f"Best Parameters: {grid_search.best_params_}")
print(f"Best Score: {grid_search.best_score_:.3f}")



Best Parameters: {'max_depth': 10, 'n_estimators': 10}
Best Score: 0.650


In [None]:
best_model = RandomForestClassifier(n_estimators=50, max_depth=10)
best_model.fit(X_train, y_train)

## **Challenges faced**

The key challenges in this machine learning task include:

1. **Data Preprocessing**: Managing missing data, dealing with the curse of dimensionality from one-hot encoding, and ensuring that feature scaling (e.g., using StandardScaler) is appropriate for the dataset's distribution.

2. **Model Training and Evaluation**: Selecting suitable models and handling imbalanced data to avoid misleading metrics. Overfitting is a concern, especially with complex models like Random Forests, making it crucial to use proper validation techniques.

3. **Hyperparameter Tuning**: Grid search cross-validation can be computationally expensive, and there’s a risk of overfitting during the tuning process. Choosing the right parameters and their ranges is essential for effective tuning.

4. **Final Model Training**: Ensuring that the final model generalizes well to unseen data while balancing performance and interpretability can be challenging, particularly with less transparent models like Random Forests.

## **Plan for Next week**

1. **Visualizations:** Plot model performance metrics (accuracy, ROC curves) for all models.

2. **Success Prediction:** Predict and visualize success rates using Indian raw materials at a cheaper price.

3. **Global Efficiency:** Forecast and visualize global stock value trends.

4. **Accuracy:** Re-evaluate and compare model accuracy on the dataset.