# <center> <h1>Project Guidelines</h1> </center>
<center> <h1>EIN 4933/6935 Python for Data Science Summer 2020</h1> </center>

Two major types of problems that machine learning algorithms try to solve are:

**Regression** — Predict continuous value of a given data point<br/>
**Classification** — Predict the class of the given data point<br/>

In you project, choose one type of problem and make sure to follow the steps listed below:

**Step-1 Problem Definition:**<br/>
Choose a problem whose data set is available. This problem should either fall into regression or classification group. You can choose one of the datasets that are listed below. You are also free to choose another dataset that you are familiar with. You must write a summary about the problem and the associated dataset. <br/>
**Step-2: Data Cleaning and Preprocessing:**<br/>
In this step, you must identify the relevant columns in the dataset that can be used as predictors. Any irrelevant data columns must be excluded from the analysis. If there are any missing values in the dataset, they should be replaced through a data imputation method. Any missing rows should be deleted from the dataset. Each column data type must be converted into an appropriate data type. The data types can be obtained either in data info files or can be determined by inspection.  <br/>
**Step-3 (optional): Feature Extraction/Addition:**<br/>
In this step, additional features can be added to the existing dataset. The existing features can be altered in a way that can better serve for training the underlying model. You can skip this step if you think there is no room for extracting/adding extra features.<br/>
**Step-4: Data Scaling (optional):**<br/>
Transform your data so that it fits within a specific scale. For example standardization or normalization. You can skip this step if you think your dataset does not contain features highly varying in magnitudes, units and range.<br/>
**Step-5: Data Splitting:**<br/>
This step should start with creating two dataframes: response and features. Split these dataframes into train and test parts.<br/>
**Step-6: Model Selection, Model Fitting and Model Evaluation:**<br/>
Choose an appropriate model that can be used for your problem. You can choose a model that is listed under a given problem group below. You are also free to choose another model that you are familiar with. Fit your train data to your model. Generate predictions over the test data and then evaluate your model by reporting appropriate accuracy metrics.<br/>
**Step-7: Report Feature Importance:**<br/>
You must report the feature importance results.<br/>
**Step-8: Improve Your Results:**<br/>
Choose one or multiple different methods to improve your baseline results that you reported in the previous step. For example, you can enumerate multiple different models and recommend the one that gives the best accuracy metrics results. Or you can develop feature elimination strategy that yields better results. You are also free to choose another method that you are familiar with.<br/>

## Regression Problem
Models to consider:<br/>
**Multiple Linear Regression**<br/>
**Decision Trees**<br/>
**Random Forecast**<br/>
**Support Vector Machines**<br/>
Any other model that can be used for regression problem.<br/>

#### Example Data Sets for Regression Problem:
**1) Nba:** Predict the salary of an NBA player.<br/> 
Dowload link: https://sites.google.com/site/yasinunlu/home/research/new1/nba.csv<br/> 
**2) Automobile Data Set:** Predict car prices.<br/> 
Dowload link:https://sites.google.com/site/yasinunlu/home/research/new1/AutomobileDataSet.xlsx<br/> 
More info: https://sites.google.com/site/yasinunlu/home/research/new1/auto_prices_info.txt<br/>
**3) Auto Mpg:** Predict the mpg of an automobile.<br/> 
Dowload link:https://sites.google.com/site/yasinunlu/home/research/new1/mpg.zip

## Classification Problem
Models to consider:<br/>
**Logistic Regression**<br/>
**Decision Trees**<br/>
**Random Forest**<br/>
**Nearest Neighbor**<br/>
**Support Vector Machines**<br/>
**Naïve Bayes**<br/>
Any other model that can be used for classification problem.<br/>

#### Example Data Sets for Classification Problem:
**1) Titanic:** Predict whether a passenger survived.<br/> 
Dowload link: https://sites.google.com/site/yasinunlu/home/research/new1/Titanic_train.csv<br/>
**2) Iris:** Predict species name.<br/>
Download link: https://sites.google.com/site/yasinunlu/home/research/new1/iris.csv<br/>
**3) Adult:** Predict whether a person makes over 50K a year.<br/>
Download link: https://sites.google.com/site/yasinunlu/home/research/new1/adult.zip<br/>

In [6]:
#Importing DataFrame and chaning college to a true/false column to see if they have a college degree or not.


import pandas as pd
NBA = pd.read_csv('https://sites.google.com/site/yasinunlu/home/research/new1/nba.csv')
NBA["College"].fillna("None", inplace = True)

nba_x = pd.DataFrame(NBA)
CollegeDegree = []
for row in nba_x["College"]:
  if row == "None" : CollegeDegree.append(False)
  else: CollegeDegree.append(True)

nba_x["CollegeDegree"] = CollegeDegree

NBA_new = nba_x[['Name','Team','Number','Position','Age','Height','Weight','CollegeDegree','Salary']]
NBA_new.head()

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,CollegeDegree,Salary
0,Avery Bradley,Boston Celtics,0,PG,25,6-2,180,True,7730337.0
1,Jae Crowder,Boston Celtics,99,SF,25,6-6,235,True,6796117.0
2,John Holland,Boston Celtics,30,SG,27,6-5,205,True,
3,R.J. Hunter,Boston Celtics,28,SG,22,6-5,185,True,1148640.0
4,Jonas Jerebko,Boston Celtics,8,PF,29,6-10,231,False,5000000.0


In [7]:
#2 Checking for Null Values

NBA_new.isnull().sum()

Name              0
Team              0
Number            0
Position          0
Age               0
Height            0
Weight            0
CollegeDegree     0
Salary           11
dtype: int64

In [8]:
#2 Replacing Null values in Salary with the mean

import numpy as np
from sklearn.impute import SimpleImputer
A=NBA_new[['Salary']]
for k in A:
  X=pd.DataFrame(data=NBA_new, columns=[k])
  imputer = SimpleImputer(missing_values = np.NaN, strategy = 'mean')
  X=imputer.fit_transform(X)
  NBA_new[k]=X 

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  NBA_new[k]=X


In [9]:
#Checking again for Null values after imputer method has been applied
NBA_new.isnull().sum()

Name             0
Team             0
Number           0
Position         0
Age              0
Height           0
Weight           0
CollegeDegree    0
Salary           0
dtype: int64

In [10]:
#2 Dropping columns that I dont need

NBA_new = NBA_new.drop(['Name', 'Number'], axis = 1)
NBA_new

Unnamed: 0,Team,Position,Age,Height,Weight,CollegeDegree,Salary
0,Boston Celtics,PG,25,6-2,180,True,7.730337e+06
1,Boston Celtics,SF,25,6-6,235,True,6.796117e+06
2,Boston Celtics,SG,27,6-5,205,True,4.842684e+06
3,Boston Celtics,SG,22,6-5,185,True,1.148640e+06
4,Boston Celtics,PF,29,6-10,231,False,5.000000e+06
...,...,...,...,...,...,...,...
452,Utah Jazz,PF,20,6-10,234,True,2.239800e+06
453,Utah Jazz,PG,26,6-3,203,True,2.433333e+06
454,Utah Jazz,PG,24,6-1,179,False,9.000000e+05
455,Utah Jazz,C,26,7-3,256,False,2.900000e+06


In [11]:
#2 Converting height column to inches 
newheight = []

for element in NBA_new['Height']:
    k = str(element) #create a string out of object
    ft = k[0]
    inch = k[2:]
    feetInches = 12*int(ft) + int(inch)
    newheight.append(feetInches)
  
NBA_new["Height"] = newheight
NBA_new.head()

Unnamed: 0,Team,Position,Age,Height,Weight,CollegeDegree,Salary
0,Boston Celtics,PG,25,74,180,True,7730337.0
1,Boston Celtics,SF,25,78,235,True,6796117.0
2,Boston Celtics,SG,27,77,205,True,4842684.0
3,Boston Celtics,SG,22,77,185,True,1148640.0
4,Boston Celtics,PF,29,82,231,False,5000000.0


In [12]:
#Converting Categorical Data into integers

from sklearn.preprocessing import LabelEncoder
LE = LabelEncoder()
NBA_new['Team'] = LE.fit_transform(NBA_new['Team'])
NBA_new['Position'] = LE.fit_transform(NBA_new['Position'])
NBA_new['CollegeDegree'] = LE.fit_transform(NBA_new['CollegeDegree'])

NBA_new.head()

Unnamed: 0,Team,Position,Age,Height,Weight,CollegeDegree,Salary
0,1,2,25,74,180,1,7730337.0
1,1,3,25,78,235,1,6796117.0
2,1,4,27,77,205,1,4842684.0
3,1,4,22,77,185,1,1148640.0
4,1,1,29,82,231,0,5000000.0


In [13]:
#4 Normalizing the Data

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler() 
df = scaler.fit_transform(NBA_new)
NBA_normalized = pd.DataFrame(df, columns = NBA_new.columns.to_list())


In [14]:
#5 Data Splitting

response = NBA_normalized[['Salary']]
feature = NBA_normalized.iloc[:,0:6]

from sklearn.model_selection import train_test_split
split = train_test_split(feature, response, test_size=0.2, random_state=2)

feature_train = split[0]
feature_test = split[1]
response_train = split[2]
response_test = split[3]



In [21]:
#6 Model Selection
#Running multiple regression models

from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
import numpy as np
from sklearn import metrics
from sklearn.linear_model import LinearRegression


models_list = [RandomForestRegressor(), 
               DecisionTreeRegressor(), 
               LinearRegression(),] 

# we put model functions in a list

model_names = ['Random Forest', 
               'Decision Tree', 
               'Multiple Linear Regression']
# model names in a list

mae_list = []
mse_list = []
rmse_list = []
results_dict = {}

for model in range(len(models_list)):
    regressor = models_list[model]
    regressor.fit(feature_train, response_train)
    response_pred=regressor.predict(feature_test)
    mae_list.append(metrics.mean_absolute_error(response_pred, response_test))
    mse_list.append(metrics.mean_squared_error(response_pred, response_test))
    rmse_list.append(np.sqrt(metrics.mean_squared_error(response_pred, response_test)))
    
result_dict = {'Model Name':model_names, 
               'Mean Absolute Error':mae_list, 
               'Mean Squared Error':mse_list,
               'Root Mean Squared Error':rmse_list}

  regressor.fit(feature_train, response_train)


In [22]:
#Results
#Multiple Linear Regression was found to be the best 

Results = pd.DataFrame(result_dict)
Results 

Unnamed: 0,Model Name,Mean Absolute Error,Mean Squared Error,Root Mean Squared Error
0,Random Forest,0.152407,0.046451,0.215524
1,Decision Tree,0.192586,0.079984,0.282815
2,Multiple Linear Regression,0.130149,0.034919,0.186867


In [23]:
#7 Reporting Feature importance

from sklearn.linear_model import LinearRegression
LR = LinearRegression()
LR.fit(feature_train, response_train)

importance = LR.coef_[0]
feature_names = NBA_normalized.columns.to_list()
feature_names.remove('Salary')
# summarize feature importance
for i, score in enumerate(importance):
    print('%s: %.5f' % (feature_names[i], score))

Team: -0.03360
Position: -0.02423
Age: 0.12669
Height: -0.18599
Weight: 0.30147
CollegeDegree: -0.02199


In [24]:
#8 Improving my results. Since I already ran all of the models for comparison,
# Im going to try and take away another column to see if that helps with the model.

NBA_Best = NBA_normalized.drop(['CollegeDegree'], axis = 1)
NBA_Best


Unnamed: 0,Team,Position,Age,Height,Weight,Salary
0,0.034483,0.50,0.285714,0.277778,0.130137,0.308359
1,0.034483,0.75,0.285714,0.500000,0.506849,0.270944
2,0.034483,1.00,0.380952,0.444444,0.301370,0.192710
3,0.034483,1.00,0.142857,0.444444,0.164384,0.044765
4,0.034483,0.25,0.476190,0.722222,0.479452,0.199010
...,...,...,...,...,...,...
452,0.965517,0.25,0.047619,0.722222,0.500000,0.088466
453,0.965517,0.50,0.333333,0.333333,0.287671,0.096217
454,0.965517,0.50,0.238095,0.222222,0.123288,0.034807
455,0.965517,0.00,0.333333,1.000000,0.650685,0.114906


In [25]:
response_best = NBA_Best[['Salary']]
feature_best = NBA_Best.iloc[:,0:5]

from sklearn.model_selection import train_test_split
split = train_test_split(feature_best, response_best, test_size=0.2, random_state=2)

feature_train = split[0]
feature_test = split[1]
response_train = split[2]
response_test = split[3]

from sklearn.linear_model import LinearRegression
LR_best = LinearRegression()
LR_best.fit(feature_train, response_train)
response_prediction_new = LR_best.predict(feature_test)


In [26]:
# Reporting results from the updated dataset
import numpy as np
from sklearn import metrics
print('\033[1m' + 'Best Model Errors') 
print('\033[0m')
print('Mean Absolute Error:', metrics.mean_absolute_error(response_test, response_prediction_new))
print('Mean Squared Error:', metrics.mean_squared_error(response_test, response_prediction_new))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(response_test, response_prediction_new)))


[1mBest Model Errors
[0m
Mean Absolute Error: 0.13059314879569325
Mean Squared Error: 0.035268427117270414
Root Mean Squared Error: 0.18779890073498942
