## INSTRUCTIONS 

Every learner should submit his/her own homework solutions. However, you are allowed to discuss the homework with each other– but everyone must submit his/her own solution; you may not copy someone else’s solution. 

The homework consists of two parts:
1.	Data from our life
2.	Classification

Follow the prompts in the attached jupyter notebook. We are using the same data as for the previous homeworks. Use the version you created called df2 where you already cleaned, dropped some of the variables but did not create dummy variables. Instead of creating dummy variables, you have to recode this column as suggested bellow.
Add markdown cells to your analysis to include your solutions, comments, answers. Add as many cells as you need, for easy readability comment when possible. 

**Note:** This homework has a bonus question, so the highest mark that can be earned is a 105.
Submission: Send in both a ipynb and a pdf file of your work.
Good luck!



# 1. Data from our lives:

### Describe a situation or problem from your job, everyday life, current events, etc., for which a classification would be appropriate.

## Your answer 

### Fake News Classification:
Detecting fake news has become paramount. Machine learning models, employing classification techniques, can distinguish between genuine and fabricated news articles. These models analyze multiple features, such as content, writing style, and source credibility, to predict the authenticity of news pieces. Accurate classification is crucial as misclassifications can lead to the spread of misinformation, impacting public perception and trust in media sources. This example underscores the importance of precise classification in preserving the credibility of news in today's online landscape.

# 2. Preprocessing

In [46]:
from scipy import stats
from sklearn.linear_model import LinearRegression
from statsmodels.compat import lzip
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm
from statsmodels.stats.outliers_influence import variance_inflation_factor
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import statsmodels.api as sm

%matplotlib inline

In [47]:
#Read in data
df =pd.read_csv('auto_imports1.csv')

df.head()

Unnamed: 0,fuel_type,body,wheel_base,length,width,heights,curb_weight,engine_type,cylinders,engine_size,bore,stroke,comprassion,horse_power,peak_rpm,city_mpg,highway_mpg,price
0,gas,convertible,88.6,168.8,64.1,48.8,2548,dohc,four,130,3.47,2.68,9.0,111,5000,21,27,13495
1,gas,convertible,88.6,168.8,64.1,48.8,2548,dohc,four,130,3.47,2.68,9.0,111,5000,21,27,16500
2,gas,hatchback,94.5,171.2,65.5,52.4,2823,ohcv,six,152,2.68,3.47,9.0,154,5000,19,26,16500
3,gas,sedan,99.8,176.6,66.2,54.3,2337,ohc,four,109,3.19,3.4,10.0,102,5500,24,30,13950
4,gas,sedan,99.4,176.6,66.4,54.3,2824,ohc,five,136,3.19,3.4,8.0,115,5500,18,22,17450


In [48]:
##your code here

# To Check the data types of all columns in the DataFrame
Auto_data_types = df.dtypes

print(Auto_data_types)

fuel_type       object
body            object
wheel_base     float64
length         float64
width          float64
heights        float64
curb_weight      int64
engine_type     object
cylinders       object
engine_size      int64
bore            object
stroke          object
comprassion    float64
horse_power     object
peak_rpm        object
city_mpg         int64
highway_mpg      int64
price            int64
dtype: object


In [49]:
## Your code here

## Replacing ? with none throught the dataset
df = df.replace('?', None)

##converting object to float variables
df['bore'] = df['bore'].astype(float)
df['stroke'] = df['stroke'].astype(float)
df['horse_power'] = df['horse_power'].astype(float)
df['peak_rpm'] = df['peak_rpm'].astype(float)

In [50]:
# Checking for remaining '?'
question_marks_remaining = (df == '?').sum().sum()

if question_marks_remaining == 0:
    print("no remaining '?' values in the dataset.")
else:
    print("{question_marks_remaining} remaining '?' values in the dataset.")


no remaining '?' values in the dataset.


In [51]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 201 entries, 0 to 200
Data columns (total 18 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   fuel_type    201 non-null    object 
 1   body         201 non-null    object 
 2   wheel_base   201 non-null    float64
 3   length       201 non-null    float64
 4   width        201 non-null    float64
 5   heights      201 non-null    float64
 6   curb_weight  201 non-null    int64  
 7   engine_type  201 non-null    object 
 8   cylinders    201 non-null    object 
 9   engine_size  201 non-null    int64  
 10  bore         197 non-null    float64
 11  stroke       197 non-null    float64
 12  comprassion  201 non-null    float64
 13  horse_power  199 non-null    float64
 14  peak_rpm     199 non-null    float64
 15  city_mpg     201 non-null    int64  
 16  highway_mpg  201 non-null    int64  
 17  price        201 non-null    int64  
dtypes: float64(9), int64(5), object(4)
memory usage: 2

In [52]:
## Your code here

# Droping the specified columns and creating a new DataFrame df2
df2 = df.drop(columns=["body", "engine_type", "cylinders"])

In [53]:
df2.head()

Unnamed: 0,fuel_type,wheel_base,length,width,heights,curb_weight,engine_size,bore,stroke,comprassion,horse_power,peak_rpm,city_mpg,highway_mpg,price
0,gas,88.6,168.8,64.1,48.8,2548,130,3.47,2.68,9.0,111.0,5000.0,21,27,13495
1,gas,88.6,168.8,64.1,48.8,2548,130,3.47,2.68,9.0,111.0,5000.0,21,27,16500
2,gas,94.5,171.2,65.5,52.4,2823,152,2.68,3.47,9.0,154.0,5000.0,19,26,16500
3,gas,99.8,176.6,66.2,54.3,2337,109,3.19,3.4,10.0,102.0,5500.0,24,30,13950
4,gas,99.4,176.6,66.4,54.3,2824,136,3.19,3.4,8.0,115.0,5500.0,18,22,17450


In [54]:
## your code goes here

## Droping rows with nan values
df2.dropna(inplace=True)


In [55]:
df2.isnull().sum()

fuel_type      0
wheel_base     0
length         0
width          0
heights        0
curb_weight    0
engine_size    0
bore           0
stroke         0
comprassion    0
horse_power    0
peak_rpm       0
city_mpg       0
highway_mpg    0
price          0
dtype: int64

In [56]:
## Your code goes here

# Creating dummy variables for fuel_type within df2 and droping the first level
#df2 = pd.get_dummies(df2, columns=['fuel_type'], drop_first=True)

In [57]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
Index: 195 entries, 0 to 200
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   fuel_type    195 non-null    object 
 1   wheel_base   195 non-null    float64
 2   length       195 non-null    float64
 3   width        195 non-null    float64
 4   heights      195 non-null    float64
 5   curb_weight  195 non-null    int64  
 6   engine_size  195 non-null    int64  
 7   bore         195 non-null    float64
 8   stroke       195 non-null    float64
 9   comprassion  195 non-null    float64
 10  horse_power  195 non-null    float64
 11  peak_rpm     195 non-null    float64
 12  city_mpg     195 non-null    int64  
 13  highway_mpg  195 non-null    int64  
 14  price        195 non-null    int64  
dtypes: float64(9), int64(5), object(1)
memory usage: 24.4+ KB


In [58]:
df2.head()

Unnamed: 0,fuel_type,wheel_base,length,width,heights,curb_weight,engine_size,bore,stroke,comprassion,horse_power,peak_rpm,city_mpg,highway_mpg,price
0,gas,88.6,168.8,64.1,48.8,2548,130,3.47,2.68,9.0,111.0,5000.0,21,27,13495
1,gas,88.6,168.8,64.1,48.8,2548,130,3.47,2.68,9.0,111.0,5000.0,21,27,16500
2,gas,94.5,171.2,65.5,52.4,2823,152,2.68,3.47,9.0,154.0,5000.0,19,26,16500
3,gas,99.8,176.6,66.2,54.3,2337,109,3.19,3.4,10.0,102.0,5500.0,24,30,13950
4,gas,99.4,176.6,66.4,54.3,2824,136,3.19,3.4,8.0,115.0,5500.0,18,22,17450


In [59]:
df2['fuel_type'].value_counts()

fuel_type
gas       175
diesel     20
Name: count, dtype: int64

In our class we covered multiple classification methods. In this part of the home work you can compare them 

**Use the dataset 'auto_imports1.csv' from our previous homeworks. More specifically, use the version you created called df2 where you already cleaned, dropped some of the variables but DID NOT CREATE dummy variables. Follow the prompts to complete the homework.**

## 2.1 **Replace ['gas', 'diesel'] string values to [0, 1]**

In [60]:
#Your code
fuel_dict = {'gas': 0, 'diesel': 1}
df2['fuel_type'] = df2['fuel_type'].replace(fuel_dict)
df2.head()


Unnamed: 0,fuel_type,wheel_base,length,width,heights,curb_weight,engine_size,bore,stroke,comprassion,horse_power,peak_rpm,city_mpg,highway_mpg,price
0,0,88.6,168.8,64.1,48.8,2548,130,3.47,2.68,9.0,111.0,5000.0,21,27,13495
1,0,88.6,168.8,64.1,48.8,2548,130,3.47,2.68,9.0,111.0,5000.0,21,27,16500
2,0,94.5,171.2,65.5,52.4,2823,152,2.68,3.47,9.0,154.0,5000.0,19,26,16500
3,0,99.8,176.6,66.2,54.3,2337,109,3.19,3.4,10.0,102.0,5500.0,24,30,13950
4,0,99.4,176.6,66.4,54.3,2824,136,3.19,3.4,8.0,115.0,5500.0,18,22,17450


## 2.2 : Define your X and y: your dependent variable is fuel_type, the rest of the variables are your independent variables

In [61]:
#your code
X = df2.drop('fuel_type', axis=1)  # Independent variables
y = df2['fuel_type']  # Dependent variable


## 2.3 Split your data into training and testing set. Use test_size=0.3, random_state=746 !

In [62]:
#your code
from sklearn.model_selection import train_test_split

# Splitting the data into training and testing sets
X_train_custom, X_test_custom, y_train_custom, y_test_custom = train_test_split(X, y, test_size=0.3, random_state=746)


# 3. Classification

### 3.1 Use Logistic regression to classify your data. Print/report your confusion matrix, classification report and AUC

In [63]:
#your code
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report, roc_auc_score

# Initialize and fit Logistic Regression model
custom_logistic_model = LogisticRegression(max_iter=1000)
custom_logistic_model.fit(X_train_custom, y_train_custom)

# Predict on the test set
custom_y_pred = custom_logistic_model.predict(X_test_custom)

# Confusion matrix
custom_conf_matrix = confusion_matrix(y_test_custom, custom_y_pred)
print("Custom Confusion Matrix:")
print(custom_conf_matrix)

# Classification report
print("\nCustom Classification Report:")
print(classification_report(y_test_custom, custom_y_pred))

# Calculate AUC
custom_auc = roc_auc_score(y_test_custom, custom_y_pred)
print(f"\nCustom AUC: {custom_auc}")



Custom Confusion Matrix:
[[50  0]
 [ 0  9]]

Custom Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        50
           1       1.00      1.00      1.00         9

    accuracy                           1.00        59
   macro avg       1.00      1.00      1.00        59
weighted avg       1.00      1.00      1.00        59


Custom AUC: 1.0


### 3.2 Use Naive Bayes to classify your data. Print/report your confusion matrix, classification report and AUC

In [64]:
#your code
from sklearn.naive_bayes import GaussianNB

# Initialize and fit Gaussian Naive Bayes model
custom_nb_model = GaussianNB()
custom_nb_model.fit(X_train_custom, y_train_custom)

# Predict on the test set
custom_y_pred_nb = custom_nb_model.predict(X_test_custom)

# Confusion matrix
custom_conf_matrix_nb = confusion_matrix(y_test_custom, custom_y_pred_nb)
print("Custom Confusion Matrix (Naive Bayes):")
print(custom_conf_matrix_nb)

# Classification report
print("\nCustom Classification Report (Naive Bayes):")
print(classification_report(y_test_custom, custom_y_pred_nb))

# Calculate AUC for Naive Bayes
custom_auc_nb = roc_auc_score(y_test_custom, custom_y_pred_nb)
print(f"\nCustom AUC (Naive Bayes): {custom_auc_nb}")



Custom Confusion Matrix (Naive Bayes):
[[50  0]
 [ 0  9]]

Custom Classification Report (Naive Bayes):
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        50
           1       1.00      1.00      1.00         9

    accuracy                           1.00        59
   macro avg       1.00      1.00      1.00        59
weighted avg       1.00      1.00      1.00        59


Custom AUC (Naive Bayes): 1.0


### 3.3 Use KNN to classify your data. First find the optimal k and than run you classification. Print/report your confusion matrix, classification report and AUC

In [65]:
#your code
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import confusion_matrix, classification_report, roc_auc_score

# Define the range of k values to test
custom_k_values = list(range(1, 21))

# Create a parameter grid
custom_param_grid = {'n_neighbors': custom_k_values}

# Initialize KNN classifier
custom_knn = KNeighborsClassifier()

# Perform grid search to find the optimal k
custom_grid_search = GridSearchCV(custom_knn, custom_param_grid, cv=5, scoring='accuracy')
custom_grid_search.fit(X_train_custom, y_train_custom)

# Get the best k value
custom_best_k = custom_grid_search.best_params_['n_neighbors']
print(f"Custom Best k value: {custom_best_k}")

# Initialize KNN classifier with best k
custom_knn = KNeighborsClassifier(n_neighbors=custom_best_k)
custom_knn.fit(X_train_custom, y_train_custom)

# Predict on the test set
custom_y_pred_knn = custom_knn.predict(X_test_custom)

# Confusion matrix
custom_conf_matrix_knn = confusion_matrix(y_test_custom, custom_y_pred_knn)
print("\nCustom Confusion Matrix (KNN):")
print(custom_conf_matrix_knn)

# Classification report
print("\nCustom Classification Report (KNN):")
print(classification_report(y_test_custom, custom_y_pred_knn))

# Calculate AUC for KNN
custom_auc_knn = roc_auc_score(y_test_custom, custom_y_pred_knn)
print(f"\nCustom AUC (KNN): {custom_auc_knn}")




Custom Best k value: 2

Custom Confusion Matrix (KNN):
[[50  0]
 [ 8  1]]

Custom Classification Report (KNN):
              precision    recall  f1-score   support

           0       0.86      1.00      0.93        50
           1       1.00      0.11      0.20         9

    accuracy                           0.86        59
   macro avg       0.93      0.56      0.56        59
weighted avg       0.88      0.86      0.82        59


Custom AUC (KNN): 0.5555555555555556


### 3.4 Choose one: SVM or Random Forest to classify your data. Print/report your confusion matrix, classification report and AUC

In [66]:
#your code
from sklearn.ensemble import RandomForestClassifier

# Initialize Random Forest Classifier
custom_random_forest = RandomForestClassifier(random_state=746)

# Fit the classifier to the training data
custom_random_forest.fit(X_train_custom, y_train_custom)

# Predict on the test set
custom_y_pred_rf = custom_random_forest.predict(X_test_custom)

# Confusion matrix
custom_conf_matrix_rf = confusion_matrix(y_test_custom, custom_y_pred_rf)
print("\nCustom Confusion Matrix (Random Forest):")
print(custom_conf_matrix_rf)

# Classification report
print("\nCustom Classification Report (Random Forest):")
print(classification_report(y_test_custom, custom_y_pred_rf))

# Calculate AUC for Random Forest
custom_auc_rf = roc_auc_score(y_test_custom, custom_y_pred_rf)
print(f"\nCustom AUC (Random Forest): {custom_auc_rf}")




Custom Confusion Matrix (Random Forest):
[[50  0]
 [ 0  9]]

Custom Classification Report (Random Forest):
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        50
           1       1.00      1.00      1.00         9

    accuracy                           1.00        59
   macro avg       1.00      1.00      1.00        59
weighted avg       1.00      1.00      1.00        59


Custom AUC (Random Forest): 1.0


### 3.5 Compare your results and comment on your findings. Which one(s) did the best job? What could have been the problem with the ones that did not work? etc.

### 1. **Logistic Regression and Naive Bayes:**
Both Logistic Regression and Naive Bayes achieved flawless accuracy and performed exceptionally well. These models assumed different underlying principles:
- **Logistic Regression** is a linear model that works well when the relationship between features and the target is linear. It achieved perfect accuracy here, indicating that the data might be easily separable by a linear boundary.
- **Naive Bayes** assumes feature independence and performs remarkably well in certain conditions, as seen here. Its perfect accuracy suggests that the feature independence assumption might hold true for this dataset.

### 2. **KNN:**
KNN achieved an accuracy of 86%, but it struggled with the minority class (class 1), exhibiting lower precision, recall, and F1-score for that class. This suggests a problem with its ability to generalize well on this dataset.
- **The Issue:** KNN's performance can vary significantly based on the choice of the k-value. Additionally, its performance can deteriorate when dealing with complex or imbalanced datasets.

### 3. **Random Forest:**
Random Forest, like Logistic Regression and Naive Bayes, attained perfect accuracy, demonstrating robustness and adaptability. This model's exceptional performance implies its ability to handle complex datasets and nonlinear relationships effectively.

### Summary & Comparison:
- **Logistic Regression** and **Naive Bayes** excelled, assuming different underlying principles.
- **KNN** struggled with minority class prediction due to sensitivity to k-value and dataset complexity.
- **Random Forest** showed robustness and adaptability to the dataset's complexity.

### Considerations:
- **Dataset Complexity:** The dataset's complexity and feature relationships can impact the models differently.
- **Model Suitability:** The choice of the "best" model depends on the dataset's characteristics and how well the model assumptions align with the data distribution.

In summary, Logistic Regression, Naive Bayes, and Random Forest demonstrated remarkable performance on this dataset, each showcasing strengths and limitations based on their underlying assumptions and sensitivity to data characteristics. The choice of the "best" model relies on understanding the dataset's intricacies and selecting a model that aligns well with its structure and features.

## 4. Bonus question (5 extra points)
**Try to fix the inbalanced nature of the data with a tool from the lecture. Run one of the classification methods (preferable one that "failed" before) and see if you get better results.**

In [67]:
#pip install -U scikit-learn

In [68]:
#pip install imbalanced-learn

In [69]:
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train_custom, y_train_custom)
knn_classifier = KNeighborsClassifier()
knn_classifier.fit(X_train_resampled, y_train_resampled)
y_pred_resampled = knn_classifier.predict(X_test_custom)
conf_matrix_resampled = confusion_matrix(y_test_custom, y_pred_resampled)
print("Confusion Matrix (Resampled Data):")
print(conf_matrix_resampled)
print("\nClassification Report (Resampled Data):")
print(classification_report(y_test_custom, y_pred_resampled))
auc_resampled = roc_auc_score(y_test_custom, y_pred_resampled)
print(f"\nAUC (Resampled Data): {auc_resampled}")


ImportError: cannot import name '_check_X' from 'imblearn.utils._validation' (/Applications/anaconda3/lib/python3.11/site-packages/imblearn/utils/_validation.py)