<a href="https://colab.research.google.com/github/gtoubian/cce/blob/main/Statistical_Modelling_Workshop.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Statistical Modelling

This workshop will be more open ended from previous workshops as you will be using different models to classify sets of data. For each of the datasets below, use Logistic Regression, Decision Trees as well as any clustering algorithm to solve the given classification problem. For each of the models, report the accuracy score and create ROC curves to show how well your model did. For each dataset, report which model did the best job with classification given these validation metrics. Try to optimize your model by using an appropriate train/test split, sampling technique (seen in Module 5) and any other preprocessing technique. 

##Income Classification

https://www.kaggle.com/lodetomasi1995/income-classification

This data set contains details of the backgrounds of several individuals and using this information, classify what income braket these individuals belong to.

##Churn Modelling

https://www.kaggle.com/shrutimechlearn/churn-modelling

This data set contains details of a bank's customers and the target variable is a binary variable reflecting the fact whether the customer left the bank (closed his account) or he continues to be a customer.

##Mobile Price Classification

https://www.kaggle.com/iabhishekofficial/mobile-price-classification

Bob has started his own mobile company. He wants to give tough fight to big companies like Apple,Samsung etc.

He does not know how to estimate price of mobiles his company creates. In this competitive mobile phone market you cannot simply assume things. To solve this problem he collects sales data of mobile phones of various companies.

Bob wants to find out some relation between features of a mobile phone(eg:- RAM,Internal Memory etc) and its selling price. But he is not so good at Machine Learning. So he needs your help to solve this problem.

In this problem you do not have to predict actual price but a price range indicating how high the price is

**NOTE:** You will have to do some research on mobile phone prices and their specs.

In [69]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
import io

In [70]:
#INCOME CLASSIFICATION:
from google.colab import files
data = files.upload()

Saving income.csv to income (3).csv


In [71]:
df = pd.read_csv(io.BytesIO(data['income.csv']))

In [72]:
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [73]:
df1 = df[['age', ' workclass',	' fnlwgt',	' education',	' education-num',	' marital-status',	' occupation',	' relationship',	' race',	' sex',	' capital-gain',	' capital-loss',	' hours-per-week',	' native-country']]

In [74]:
def encode(name):
  df[f"{name}"] = df[f"{name}"].astype('category')
  df1[f"{name}_cat"] = df[f"{name}"].cat.codes

for i in [' income', ' education', ' workclass', ' marital-status',' occupation', ' relationship', ' race', ' sex', ' native-country']:
  encode(i)

In [75]:
df[" workclass"] = df[" workclass"].astype('category')
df1["workclass_cat"] = df[" workclass"].cat.codes

df[" income"] = df[" income"].astype('category')
df1["income_cat"] = df[" income"].cat.codes

df[" marital-status"] = df[" marital-status"].astype('category')
df1["marital-status_cat"] = df[" marital-status"].cat.codes

df[" occupation"] = df[" occupation"].astype('category')
df1["occupation_cat"] = df[" occupation"].cat.codes

df[" relationship"] = df[" relationship"].astype('category')
df1["relationship_cat"] = df[" relationship"].cat.codes

df[" race"] = df[" race"].astype('category')
df1["race_cat"] = df[" race"].cat.codes

df[" sex"] = df[" sex"].astype('category')
df1["sex_cat"] = df[" sex"].cat.codes

df[" native-country"] = df[" native-country"].astype('category')
df1["native-country_cat"] = df[" native-country"].cat.codes

df[" education"] = df[" education"].astype('category')
df1["education_cat"] = df[" education"].cat.codes

In [97]:
df1.drop(columns = [" workclass", " marital-status", " occupation", " relationship", " race", " sex", " native-country", " education"])

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week,income_cat,education_cat,workclass_cat,marital-status_cat,occupation_cat,relationship_cat,race_cat,sex_cat,native-country_cat,workclass_cat.1,income_cat.1,marital-status_cat.1,occupation_cat.1,relationship_cat.1,race_cat.1,sex_cat.1,native-country_cat.1,education_cat.1
0,39,77516,13,2174,0,40,0,9,7,4,1,1,4,1,39,7,0,4,1,1,4,1,39,9
1,50,83311,13,0,0,13,0,9,6,2,4,0,4,1,39,6,0,2,4,0,4,1,39,9
2,38,215646,9,0,0,40,0,11,4,0,6,1,4,1,39,4,0,0,6,1,4,1,39,11
3,53,234721,7,0,0,40,0,1,4,2,6,0,2,1,39,4,0,2,6,0,2,1,39,1
4,28,338409,13,0,0,40,0,9,4,2,10,5,2,0,5,4,0,2,10,5,2,0,5,9
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,27,257302,12,0,0,38,0,7,4,2,13,5,4,0,39,4,0,2,13,5,4,0,39,7
32557,40,154374,9,0,0,40,1,11,4,2,7,0,4,1,39,4,1,2,7,0,4,1,39,11
32558,58,151910,9,0,0,40,0,11,4,6,1,4,4,0,39,4,0,6,1,4,4,0,39,11
32559,22,201490,9,0,0,20,0,11,4,4,1,3,4,1,39,4,0,4,1,3,4,1,39,11


In [95]:
df1.drop(" sex", " native-country", " education")

ValueError: ignored

In [89]:
X= df1.values
y=list(df1['income_cat'])

In [90]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
lr = LogisticRegression().fit(X_train, y_train)
yhat = lr.predict(X_train)

dftrain = pd.DataFrame(X_train, columns = ['age', 'workclass_cat',	' fnlwgt',	'education_cat',	' education-num',	'marital-status_cat',	'occupation_cat',	'relationship_cat',	'race_cat',	'sex_cat',	' capital-gain',	' capital-loss',	' hours-per-week',	'native-country_cat'])
dftrain['Actual'] = y_traint
dftrain['Predicted'] = yhat
dftrain.head()

ValueError: ignored

In [87]:
pd.to_numeric(df1['age'])
pd.to_numeric(df1['workclass_cat'])
pd.to_numeric(df1[' fnlwgt'])
pd.to_numeric(df1['education_cat'])
pd.to_numeric(df1[' education-num'])
pd.to_numeric(df1['marital-status_cat'])
pd.to_numeric(df1['occupation_cat'])
pd.to_numeric(df1['relationship_cat'])
pd.to_numeric(df1['race_cat'])
pd.to_numeric(df1['sex_cat'])
pd.to_numeric(df1[' capital-gain'])
pd.to_numeric(df1[' capital-loss'])
pd.to_numeric(df1[' hours-per-week'])
pd.to_numeric(df1['native-country_cat'])


0        39
1        39
2        39
3        39
4         5
         ..
32556    39
32557    39
32558    39
32559    39
32560    39
Name: native-country_cat, Length: 32561, dtype: int8