<a href="https://colab.research.google.com/github/kundyyy/100-Days-Of-ML-Code/blob/master/AfterWork_Data_Science_Classification_Analysis_with_Python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<font color="blue">To use this notebook on Google Colaboratory, you will need to make a copy of it. Go to **File** > **Save a Copy in Drive**. You can then use the new copy that will appear in the new tab.</font>

# AfterWork Data Science: Classification Analysis with Python

## Importing the Necessary Libraries

In [0]:
# We will start by running this cell which will import the necessary libraries
# ---
# 
import pandas as pd                # Pandas for data manipulation
import numpy as np                 # Numpy for scientific computations
import matplotlib.pyplot as plt    # Matplotlib for visualisation - We might not use it but just incase you decide to 
%matplotlib inline                 # This line will store and show our visualisations

## Example 

In [0]:
# Example 
# ---
# Question: Will John, 40 years old with a salary of 2500 will buy a car?
# ---
# Dataset url = http://bit.ly/SocialNetworkAdsDataset
# ---

#### Data Importation and Exploration

In [0]:
# Loading and previewing our dataset
# ---
# 
social_df = pd.read_csv('http://bit.ly/SocialNetworkAdsDataset')
social_df.head()

In [0]:
# Determining the size of our dataset
# (records, columns)
# ---
# 
social_df.shape

#### Data Preparation

In [0]:
# Normally during this stage we would perform quite a number of 
# procedures, but because our focus is only onlearning about the 
# different modeling algorithms, we will only perform once 
# essential step in ot dataset. We will perform encoding,
# which will help us transform our categorical values in our 
# dataset into numerical values. 
# Lets see what happens when we encode the gender variable 
# to have only numerical values. 
# ---
#
social_df["Gender"] = np.where(social_df["Gender"].str.contains("Male", "Female"), 1, 0)
social_df.head()

#### Data Modeling

In [0]:
# Preparing our dataset for training
# ---
# We first divide our data into attributes and labels:
# You can think of this as splitting our data set in dependent and independent variables 
# where Age and EstimatedSalary are the independent variables and Purchased are the dependent/label variable.
# ---
# 
X = social_df.iloc[:, [1, 2 ,3]].values  # Independent/predictor variables
y = social_df.iloc[:, 4].values          # Dependent/label variable

In [0]:
# Splitting the dataset into a training set and test set
# ---
# We will split our dataset into training data and test data. 
# Training data will be used to train our logistic model and test data will be used to validate our model
# Because we’ll use sklearn to split our data, we will import train_test_split from sklearn.model_selection
# ---
# 
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)

In [0]:
# Feature Scaling / Normalisation
# ---
# We then perform feature scaling / normalisation to scale our data between 0 and 1 so as to get better accuracy.
# Here, scaling is important because there is a huge difference between Age and EstimatedSalary.
# In addition, this would also reduce redundacy in our dataset. 
# ---
# 

# We import our scaler from sklearn
from sklearn.preprocessing import StandardScaler

# We make an instance sc_X of the object StandardScaler.
# You can think of making an instance as making a copy.
sc_X = StandardScaler()

# We then fit and transform X_train and X_test
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)

In [0]:
# In this example, because we will be comparing how 
# the different classification algorithms will perform, 
# we import our classifiers as shown below.
# ---
#
from sklearn.linear_model import LogisticRegression # Logistic Regression Classifier
from sklearn.tree import DecisionTreeClassifier     # Decision Tree Classifier
from sklearn.svm import SVC                         # SVM Classifier
from sklearn.naive_bayes import GaussianNB          # Naive Bayes Classifier
from sklearn.neighbors import KNeighborsClassifier  # KNN Classifier

# Below, we make an instance classifier of the object LogisticRegression, 
# DecisionTreeClassifier, SVC, GaussianNB, KNeighborsClassifier, GaussianNB.
# As we will get to see, each of the classifiers take different parameters.
# ---
# 
logistic_classifier = LogisticRegression(random_state = 0, solver='lbfgs')
decision_classifier = DecisionTreeClassifier()
svm_classifier = SVC()
knn_classifier = KNeighborsClassifier(n_neighbors=5)
naive_classifier = GaussianNB()

# Now using these classifiers to fit our data, X_train and y_train.
# By fitting we mean we train our classifiers based on the train dataset.
# ---
# Upon running this cell, we should have classifiers that can predict 
# whether a person will buy a car or not.
# ---
# Don't worry about the output, we get GaussianNB because our Naive Bayes classifier
# is the last one to be built.
# ---
#
logistic_classifier.fit(X_train, y_train)
decision_classifier.fit(X_train, y_train)
svm_classifier.fit(X_train, y_train)
knn_classifier.fit(X_train, y_train)
naive_classifier.fit(X_train, y_train)

In [0]:
# We now predict the test set results. 
# This will help us determine whether our classifiers made the correct predictions.
# ---
# No expected output here.
# ---
logistic_y_prediction = logistic_classifier.predict(X_test) 
decision_y_prediction = decision_classifier.predict(X_test) 
svm_y_prediction = svm_classifier.predict(X_test) 
knn_y_prediction = knn_classifier.predict(X_test) 
naive_y_prediction = naive_classifier.predict(X_test) 

In [0]:
# We then import evaluation metrics to determine the accuracy of classifiers
# ---
# 
from sklearn.metrics import classification_report, accuracy_score 

# The accuracy score - is the simplest way to evaluate 
# However, we note not for a highly imbalance dataset. 
# By imbalanced we mean that our original dataset would
# need to have an equal no's of 1 and 0's
# ---
print(accuracy_score(logistic_y_prediction, y_test))
print(accuracy_score(decision_y_prediction, y_test))
print(accuracy_score(svm_y_prediction, y_test))
print(accuracy_score(knn_y_prediction, y_test))
print(accuracy_score(naive_y_prediction, y_test))

# From the accuracy scores we get 90%, 90%, 93%, 93% & 91% respectively.
# The most accurate classifier being SVM & KNN. 

In [0]:
# We now print the classification report, 
# which is more reliable for a highly imbalanced dataset. 
# We use the precision values which give us accuracy values.
# 
# ---
# The precision will be "how many are correctly classified among that class".
# The recall means "how many of this class you find over the whole number of element of this class".
# The f1-score is the harmonic mean between precision & recall.
# The support is the number of occurence of the given class in your dataset.
# ---
# 
print('Logistic classifier:')
print(classification_report(y_test, logistic_y_prediction))

print('Decision Tree classifier:')
print(classification_report(y_test, decision_y_prediction))

print('SVM Classifier:')
print(classification_report(y_test, svm_y_prediction))

print('KNN Classifier:')
print(classification_report(y_test, knn_y_prediction))

print('Naive Bayes Classifier:')
print(classification_report(y_test, naive_y_prediction)) 

# From our classification report,
# Our support tells us that our dataset is highly imbalanced i.e. 63 0's and 32 1's.
# From the weighted avg which takes into account of our imbalanced
# dataset we get 90%, 90%, 93%, 93%, 91%.
# Still, the most accurate classifiers being SVM and KNN. 
# We can then further perform model opmization techiniques i.e. 
# data cleaning, feature engineering, checking for model assumptions, etc. 
# to further get the best classifier. 

In [0]:
# Answering our question
# ---
# We then make a new prediction & compare results.
# Note that we would only use the best optimized classifier for this case.
# ---
# Predict whether John, 60 years old with a salary of 2500 will buy a car or not?
# ---
# Dataset limitation: This is not a practical dataset, thus dataset will lack essential features/variables.
# In a real case scenario, we would work with may kinds of datasets that require transformation
# i.e. data cleaning, feature engineering, etc.
# ---
#
new_case = [[1,	60, 1500]]

print(logistic_classifier.predict(new_case))
print(decision_classifier.predict(new_case))
print("Best Classifier SVM", svm_classifier.predict(new_case))
print("Best Classifier KNN", knn_classifier.predict(new_case))
print(naive_classifier.predict(new_case))

##<font color="green">Challenges</font>

###<font color="green">Challenge 1</font>

In [0]:
# Challenge 1
# ---
# Question: As a Reseacher at KEMRI you are performing research on diabetes.
# Create the a classifier to determine whether a person has diabetes or not
# from the given the following sample dataset.
# ---
# Dataset url = http://bit.ly/ADiabetesDataset
# ---
# OUR CODE GOES BELOW
#

###<font color="green">Challenge 2</font>

In [0]:
# Challenge 2
# ---
# Question: A cancer medical reasearch institution would like to make predictions on two different 
# cancer types benign and malignant. Build a model to predict the breast cancer type 
# (0 = benign or 1 = malignant) given the following dataset. In addition, make a prediction.
# NB: Remember to record your observations.
# ---
# Dataset url = http://bit.ly/BreastCancersDataset
# ---
# OUR CODE GOES BELOW
#

###<font color="green">Challenge 3</font>

In [0]:
# Challenge 3
# ---
# Question: Build a classifier to predict car sales and check the accuracy of the prediction.
# given the following dataset
# ---
# Dataset url = https://bit.ly/3dvU2BB
# ---
# OUR CODE GOES BELOW
#