## Histograms/Statistics
Originally, the machine learning models were run filtered images rather than the statiatics of those images. The base accuracy for this project is 33%, as there are 3 classes. Below is the results for a random forest model on the Prewitt Edge Detection filtered image. The picture of the result below will be referred to as *Figure 1*. 

<img src='https://drive.google.com/uc?id=121kofeA5KuPZdJcidhgCiYAzEH7ngry5' width = 600>

The accuracy is 43% for the testing and 63% for the training. Not only is the accuracy low, the model is overfitting despite parameter tuning. 

In order to extract features from the images of damaged cars with a good accuracy and less overfitting, the histograms of different edge filters and the statistics of an image are combined together. The histograms are taken of the original image, and the images with the three edge filters (Prewitt, Sobel and Scharr) and put into 64 bins.

The original image and histogram of the image are shown below:


<img src='https://drive.google.com/uc?id=1MlL6p5VD0nfy91-vx3PL2JwQcajyvC4M' width = 600>



##Edge Detection
Edge Detection is the process of detecting edges or boundaries of an object in an image. It works by identifying sharp changes in image brightness. Detecting edges is useful for image segmentation, object detection and feature extraction. There are a varity of methods of edge detection such as Sobel, Prewitt, and Scharr. There are more methods, however we will focus on the three mentioned above for the project.

These three techniques are first order derivative.

**Prewitt:**

The Prewitt is a operate that calculates the gradient of an image intensity at each pixel and gives the direction of the highest possible increase from light to dark as well as the rate of that increase.

The code for extracting the histograms for Prewitt is shown below:


```
def edge_filter(dataset):
  edge_dataset = pd.DataFrame()
  for image in range(dataset.shape[0]):
      img_edge = dataset[image, :,:]
      fd = prewitt(img_edge)
      hist1,bins = np.histogram(fd.ravel(),64)
      df1= pd.DataFrame(hist1)
      df1=df1.transpose()
      edge_dataset= pd.concat([edge_dataset, df1], ignore_index=True)    
  return edge_dataset
```

For each image in the dataset, the function edge filter adds a Prewitt Edge Filter on the image and ravels the image into a histogram with 64 bins with the np.Histogram function. The results of this are put into a dataframe where the columns are the bins, in this case there are 64 and the rows are each image. This same process is repeated for Sobel and Scharr. 

Below is an example of the Prewitt filter applied on an image and the histogram of that image. 


<img src='https://drive.google.com/uc?id=1MRYeCj_2ZcFeLbrbsWUbFA0pqgASH3XW' width = 600>


**Sobel**

Another technique for detecting edges using the Sobel operator. Sobel is mainly used for emohasizing image edges. It works by getting either the gradient vector or norm of that vector at each point. The code to get the histograms and an example of the output are shown below for Sobel

```
def sobel_filter(dataset):
  sobel_dataset = pd.DataFrame()
  for image in range(dataset.shape[0]):
      img_c = dataset[image, :,:]
      fd_sobel = scharr(img_c)
      hist_sobel,bins = np.histogram(fd_sobel.ravel(),64)
      df_sobel= pd.DataFrame(hist_sobel)
      df_sobel=df_sobel.transpose()
      sobel_dataset= pd.concat([sobel_dataset, df_sobel], ignore_index=True)   
  return sobel_dataset
```



<img src='https://drive.google.com/uc?id=1qElJP6CNMVE27eW2Y8otDpjbYexMbn5k' width = 600>


**Scharr**

Scharr filter is a method used to highlight gradient edges along the x-axis and yaxis independently. The performace of Scharr can be similar to Sobel, however Scharr is able to give more accurate results where Sobel fails to do so. The code to get the histograms and an example of the output are shown below for Scharr

```
def scharr_filter(dataset):
  scharr_dataset = pd.DataFrame()
  for image in range(dataset.shape[0]):
      img_c = dataset[image, :,:]
      fd_scharr = scharr(img_c)
      hist_scharr,bins = np.histogram(fd_scharr.ravel(),64)
      df_scharr= pd.DataFrame(hist_scharr)
      df_scharr=df_scharr.transpose()
      scharr_dataset= pd.concat([scharr_dataset, df_scharr], ignore_index=True)   
  return scharr_dataset
```




<img src='https://drive.google.com/uc?id=19hVgnEepQLXtYL9pHsFKURAwRV0k2gZ8' width = 600>

In all of these images we can see the edges of the damage to the front of car as well as the differences in the distribution between the original image and the edge filtered image. The histograms of each of these images are concatonated together so that there is one long row for each image. After all the statistics were concatonated together, we end up with a dataframe with over 300+ columns. Running models on the entire data would cause overfitting, therefore we used feature selection to pick the best features. The dataframe with all the columns is shown below:

<img src='https://drive.google.com/uc?id=1nUKLs3-uvEWUUuGo4Zm44YC5zD_A5fno' width = 650>

**Random Forest Feature Importance**

Selecting the right features is beneficial because it gives a better understanding of the model and helps us focus on the most important features. The random forest model has a built-in feature importance algorithm that will give us the best features for this dataset, for this project the top 40 features will be used. The X_train and y_train is fit on a RandomForestClassifier and the feature importance is calculated for each feature. The features are then sorted from most importance to least and the top 40 are taken.




```
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=123, shuffle = True)
model = RandomForestClassifier(criterion = 'entropy',random_state=42, max_depth=6)
model.fit(X_train,y_train)
feature_importance = pd.Series(model.feature_importances_, index=X.columns)
sorted_imp = abs(feature_importance).sort_values(ascending=False)
colums_use = sorted_imp.index[0:40]
```



From the bar plot below, we see that statistic features such as Sum_Entropy, mean_entropy and mean_roberts have the highest feature importance. The histograms bins such as Sobel and Scharr are also present. 
<img src='https://drive.google.com/uc?id=11hkcrUd0Fmw7jRuPcpL-60LmRPr2XrMp' width = 680 height = 500>


The dataset with the selected features were run on 4 different Machine Learning Models. 

**Random Forest**

GridSearchCV was used to find the parameters that result in the best accuracy. The paramters that were fed into GridSearch were Criterion, Max Depth, Minimum Sample Split, Minimum Sample Leafs and Oob_score. Without tuning the parameters the model with overfit. Therefore, we need to regularize. For random forest, tuning parameters such as decreasing the max depth and increasing the minimum sample leafs will reduce overfitting. The model is fit on X_train and y_train and predicted on X_test. 



```
param_grid = { 
    'max_depth' : range(1,6),
    'criterion' :['gini', 'entropy'],
    'max_features': ['sqrt', 'log2', None],
    'min_samples_split':[2,3, 4, 5,],
    'min_samples_leaf': [2, 3, 4, 5, 6,]
    oob_score: [True, False]
}

rf = RandomForestClassifier()
cv_rf = GridSearchCV(estimator=rf, param_grid=param_grid, cv= 5)
cv_rf.fit(X_train, y_train)
cv_rf.best_params_


rf=RandomForestClassifier(criterion= 'entropy', max_depth= 5, max_features=None, min_samples_split=5, min_samples_leaf=6, oob_score=True)
rf.fit(X_train, y_train)
pred=rf.predict(X_test)
```




<img src='https://drive.google.com/uc?id=1PvQPp5w3gzLA3PJQWjo_Vl5Hj2UZIuEf' width = 650>

There was a improvement in the accuracy for the random forest with feature selection on Histograms/Statistics when compared to the random forest model on the prewitt image (Figure 1). The accuracy for the testing data is 54%, which is more than a 10% increase. Additionally, the training accuracy is 59%, indicating that there is not much overfitting. The F1 scores for lable 0(Minor) and label 2(Severe) is 0.63 and 0.59 respectivley. This demonstrates that the model does a good job of predicting these 2 classes, but it has trouble predicting class 1 (Moderate).





**Logistic Regression**

The next model is a Logistic Regression model with parameters max_iter, solver and penalty. The model was fit on X_train and y_train and then predicted on X_test. The classification report was printed. 



```
param_grid = [    
    {'penalty' : ['l1', 'l2', 'elasticnet', 'none'],
    'solver' : ['lbfgs','newton-cg','liblinear','sag','saga'],
    'max_iter' : [100, 1000,2500, 5000]
    }
]

log_grid = LogisticRegression()
cv = GridSearchCV(estimator=log_grid, param_grid=param_grid, cv= 5)
cv.fit(X_train, y_train)

cv.best_params_

log = LogisticRegression(max_iter=2500, solver = 'lbfgs', penalty = 'l2')
log.fit(X_train,y_train)
pred=log.predict(X_test)
```




<img src='https://drive.google.com/uc?id=1ZDplB73HJIaO0D9SUDR7nUorwTpWrm4Y' width = 550 height= 500 text-align=center>

The Logistic Regression model preforms similar to the Random Forest model with an accuracy of 54%, however it has almost no overfitting after tuning parameters with a training accuracy of 54%. The F1 score is higher for class 0 and 2, but lower for class 1. 

**Naive Bayes**

Naive Bayes, a machine learning model used for classification tasks, assumes conditional independece between every pair of features given the target variable. GaussianNB does not take many parameters, therefore gridsearchCV was not done for this model. 



```
nb = GaussianNB()
nb.fit(X_train,y_train)
pred=nb.predict(X_test)
```


<img src='https://drive.google.com/uc?id=1yOanKdceFKZpADCaE6gulVq2F_bhRB_F' width = 600 height = 600>

Despite not overfitting, the Naive Bayes models performs worse when compared to Random Forest and Logistic Regression. However, when compared to the results in *Figure 1*, it still performs better. 


**SVM**

SVM or Support Vector Machine, is a supervised machine learning model that works well with smaller but complex datasets and is effective in high dimensional spaces. GrisSearchCV is used with parameters C and gamma. The best parameters are put into the SVM model and fit on X_train and y_train

```
param_grid = {'C': [0.1, 1, 10, 100, 1000], 
              'gamma': [1, 0.1, 0.01, 0.001, 0.0001]} 

svm_grid = SVM()
cv = GridSearchCV(estimator=svm_grid, param_grid=param_grid, cv= 5)
cv.fit(X_train, y_train)

cv.best_params_

clf = svm.SVC(C=1000, gamma = .01)
clf.fit(X_train, y_train)
pred=clf.predict(X_test)
```


<img src='https://drive.google.com/uc?id=13ExS5BLDYbxg_4GXL031NUHjjipRCDi0' width = 600 height = 600>

The results of the SVM Model are similar to that of the Random Forest model. Accuracy is 54%, with no overfitting (training accuracy is 55%). The F1 Score is higher for Label 0 and 2, as well as the precision and recall. The F1 Score, recall are significantly lower for label 1. 


Below is the summary of accuracy results for each model with this method.

<img src='https://drive.google.com/uc?id=1ZJqr3AepwtNmSS4U3IvLjQiBnb7VQVdI' width = 500 height = 450>

