# PYTHON FOR DATA SCIENCE TUTORIAL NOTEBOOK

This tutorial serves as a crash course for people who want to quickly <br>
start doing analysis using the Python ecosystem. As such only the most <br>
important concepts are covered.<br>
<br>
This notebook serves as a starting point and playground for further experimentation. 

## CONTENT
<ul>
    <li>Python</li>
    <li>Building a Simple Model Using Pandas and Scikit-learn</li>
</ul>

## USING THE NOTEBOOK
- Notebook has cells
- Each cell can be run by hitting SHIFT + ENTER
- A notebook is stateful, so be careful with variables created in previouse cells

## PYTHON

Python is a high-level programming language. Others include Java and JavaScript.<br>
Each langauge has its strengths and weaknesses for building different kinds<br>
of systems.<br>
<br>
Python seems to be the de facto language for Data Science and Engineering mainly <br>
because it is relatively easy to learn and has a large ecosystem of libraries<br>
like Pandas and Scikit-learn, which make working with Data much easier and faster.<br>
#### PROGRAM
A computer program consists of Data structures and Algorithms.<br>
The data structures hold the data in memory while the algorithm<br>
is the control/logic/instruction-set which the computer uses to manipulate<br>
the data to get a desired result.<br>

We will learn the basics of writing Python code by examining a simple one

#### PROBLEM
1. List all the odd and even numbers between 1 and 20
2. What is the frequency of each letter in the word `ENGINEERING`

In [1]:
# QUESTION 1
# Solution step 1: Put the numbers 1 to 20 into a data structure.
# We use a popular one called a LIST

numbers = list(range(1,21))

odd_numbers = []  # Empty lists that will hold the results after manipulation
even_numbers = []

print(numbers)

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]


In [2]:
# Solution step 2: Loop through the list and get the odd and even numbers using 
# CONDITION OR IF STATEMENTS

for i in numbers:
    if i%2 == 0:
        even_numbers.append(i)
    else:
        odd_numbers.append(i)

print(f"The odd numbers are : {odd_numbers}")
print(f"The even numbers are : {even_numbers}")

The odd numbers are : [1, 3, 5, 7, 9, 11, 13, 15, 17, 19]
The even numbers are : [2, 4, 6, 8, 10, 12, 14, 16, 18, 20]


In [3]:
# QUESTION 2
# We make our solution reusable by creating a FUNCTION

def letter_counter(word):
    
    result_dict = {} # Any type of data structure called a DICTIONARY made of key-value pairs
    
    for letter in word:
        if letter in result_dict:
            result_dict[letter] = result_dict[letter] + 1
        else:
            result_dict[letter] = 1
    return result_dict

In [4]:
# Now, we can call our function with the word 'ENGINEERING'
letter_counter("ENGINEERING")

{'E': 3, 'N': 3, 'G': 2, 'I': 2, 'R': 1}

In [5]:
# Or any other word 
letter_counter("ARCHITECTURE")

{'A': 1, 'R': 2, 'C': 2, 'H': 1, 'I': 1, 'T': 2, 'E': 2, 'U': 1}

In [6]:
# SCRATCH PAD

## BUILDING A SIMPLE MODEL USING PANDAS AND SCIKIT-LEARN

With your knowledge of Python, you should be able to solve many problems<br>
There are however many libraries which help you perform tasks faster than with vanilla python<br>

Pandas is a very useful library for manipulating datasets.
Scikit-learn is a very powerful library for creating machine learning models.

Using these libraries, we would take a dataset with labels and train a model on it (Supervised Learning).

In [7]:
import pandas as pd

In [8]:
# Read in the data from the source. In this case, a CSV file
import pandas as pd

dataset = pd.read_csv("adult_v1.csv")
dataset.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income_band
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


#### STEP 1:  MANIPULATION AND PREPROCESSING

In [9]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             48842 non-null  int64 
 1   workclass       48842 non-null  object
 2   fnlwgt          48842 non-null  int64 
 3   education       48842 non-null  object
 4   education-num   48842 non-null  int64 
 5   marital-status  48842 non-null  object
 6   occupation      48842 non-null  object
 7   relationship    48842 non-null  object
 8   race            48842 non-null  object
 9   sex             48842 non-null  object
 10  capital-gain    48842 non-null  int64 
 11  capital-loss    48842 non-null  int64 
 12  hours-per-week  48842 non-null  int64 
 13  native-country  48842 non-null  object
 14  income_band     48842 non-null  object
dtypes: int64(6), object(9)
memory usage: 5.6+ MB


In [10]:
# Get Descriptive stats for numerical columns of the dataset
dataset.describe()

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week
count,48842.0,48842.0,48842.0,48842.0,48842.0,48842.0
mean,38.643585,189664.1,10.078089,1079.067626,87.502314,40.422382
std,13.71051,105604.0,2.570973,7452.019058,403.004552,12.391444
min,17.0,12285.0,1.0,0.0,0.0,1.0
25%,28.0,117550.5,9.0,0.0,0.0,40.0
50%,37.0,178144.5,10.0,0.0,0.0,40.0
75%,48.0,237642.0,12.0,0.0,0.0,45.0
max,90.0,1490400.0,16.0,99999.0,4356.0,99.0


In [11]:
# Change the labels to 0 and 1 to make it easier to work with
# Data Manipulation 1: MUTATION


def encode_income_band(x):
    if(x==">50K"):
        return 1
    else:
        return 0
    
dataset['income_band_label'] = dataset['income_band'].apply(encode_income_band)
dataset.sample(5)

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income_band,income_band_label
37150,67,Self-emp-inc,22313,Some-college,10,Married-civ-spouse,Sales,Husband,White,Male,20051,0,40,United-States,>50K.,0
40456,75,?,34235,HS-grad,9,Widowed,?,Not-in-family,White,Female,2964,0,14,United-States,<=50K.,0
47763,56,Private,217775,HS-grad,9,Divorced,Transport-moving,Not-in-family,White,Male,0,0,50,United-States,>50K.,0
37891,23,Private,189203,Assoc-voc,11,Never-married,Adm-clerical,Not-in-family,White,Male,0,0,50,United-States,<=50K.,0
44030,65,State-gov,172348,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,20,United-States,<=50K.,0


#### SUPERVISED LEARNING DATA PREPARATION

For Supervised Learning, you have to split the dataset into<br>
FEATURES and the LABEL as shown below.

![image.png](attachment:image.png)
[Link to original image](https://jakevdp.github.io/PythonDataScienceHandbook/06.00-figure-code.html#Features-and-Labels-Grid)

In [12]:
# Select columns for modeling: FEATURES and LABELS
# Data Manipulation 2: SELECT (Column extraction)

y = dataset['income_band_label'] # This selects all the rows in the column given
print('Sample of Y values')
print(y.sample(3))


features = ['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss',
           'hours-per-week', 'sex']

X = dataset.loc[:, features]  # This selects all the rows and only the columns in the list 'features'
print()
print('Sample of X values')
print(X.sample(3))

Sample of Y values
5582     1
21652    1
23842    0
Name: income_band_label, dtype: int64

Sample of X values
       age  fnlwgt  education-num  capital-gain  capital-loss  hours-per-week  \
38476   25  214468             13             0             0              40   
4746    28   78870             13          8614             0              40   
23195   25  257064             10             0             0              38   

          sex  
38476  Female  
4746     Male  
23195    Male  


In [13]:
# Quick method to convert categorical variables to numerical form
# In this case, the column sex is converted to sex_Female and sex_Male automatically

X = pd.get_dummies(X)
X.head(5)

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week,sex_Female,sex_Male
0,39,77516,13,2174,0,40,0,1
1,50,83311,13,0,0,13,0,1
2,38,215646,9,0,0,40,0,1
3,53,234721,7,0,0,40,0,1
4,28,338409,13,0,0,40,1,0


In [14]:
# Next, we get the scikit-learn library 
from sklearn.model_selection import train_test_split

## Why do we need to split the dataset?

![overfitting_underfitting](over_under.png) <br>
[Link to original image](https://www.educative.io/edpresso/overfitting-and-underfitting)

In [15]:
# You split the dataset into 2: one part for Training your model.
# The other part for testing that the model works well enough (evaluation)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=333)

In [16]:
# Check the sizes of the produced datasets. It shows the number of rows and columns
print(f"Train features shape: {X_train.shape}")
print(f"Test features shape: {X_test.shape}")
print(f"Train label shape: {y_train.shape}")
print(f"Test label shape: {y_test.shape}")

Train features shape: (39073, 8)
Test features shape: (9769, 8)
Train label shape: (39073,)
Test label shape: (9769,)


#### STEP 2: MODELLING

In [17]:
# Here we select an algorithm to use to build the model
# We will use a simple one called Logistic Regression

from sklearn.linear_model import LogisticRegression

In [18]:
# Instantiate the classifier
clf = LogisticRegression()

In [19]:
# Train your classifier on your data
# It is fed the FEATURES and corresponding LABELS so that it learns a mapping function
# between them
clf.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [20]:
# Use your classifier model to make prediction on your Test Features
model_predictions = clf.predict(X_test)

#### STEP 3: EVALUATION

In [21]:
from sklearn.metrics import accuracy_score

In [22]:
# We will evaluate our model by checking the accuracy of the predictions
# when compared with the actual Test labels

accuracy = accuracy_score(y_test, model_predictions)
print(f"{accuracy}")

0.8356024158050978


### DISCUSSION: 
HOW CAN WE INCREASE THE ACCURACY OF THE MODEL?