# warmup

In [1]:
import pandas as pd
import numpy as np

In [5]:
loans_df = pd.read_csv('loans_full_schema.csv')
loans_df.head(50)

Unnamed: 0,emp_title,emp_length,state,homeownership,annual_income,verified_income,debt_to_income,annual_income_joint,verification_income_joint,debt_to_income_joint,...,sub_grade,issue_month,loan_status,initial_listing_status,disbursement_method,balance,paid_total,paid_principal,paid_interest,paid_late_fees
0,global config engineer,3.0,NJ,MORTGAGE,90000.0,Verified,18.01,,,,...,C3,Mar-2018,Current,whole,Cash,27015.86,1999.33,984.14,1015.19,0.0
1,warehouse office clerk,10.0,HI,RENT,40000.0,Not Verified,5.04,,,,...,C1,Feb-2018,Current,whole,Cash,4651.37,499.12,348.63,150.49,0.0
2,assembly,3.0,WI,RENT,40000.0,Source Verified,21.15,,,,...,D1,Feb-2018,Current,fractional,Cash,1824.63,281.8,175.37,106.43,0.0
3,customer service,1.0,PA,RENT,30000.0,Not Verified,10.16,,,,...,A3,Jan-2018,Current,whole,Cash,18853.26,3312.89,2746.74,566.15,0.0
4,security supervisor,10.0,CA,RENT,35000.0,Verified,57.96,57000.0,Verified,37.66,...,C3,Mar-2018,Current,whole,Cash,21430.15,2324.65,1569.85,754.8,0.0
5,,,KY,OWN,34000.0,Not Verified,6.46,,,,...,A3,Jan-2018,Current,whole,Cash,4256.71,873.13,743.29,129.84,0.0
6,hr,10.0,MI,MORTGAGE,35000.0,Source Verified,23.66,155000.0,Not Verified,13.12,...,C2,Jan-2018,Current,whole,Cash,22560.0,2730.51,1440.0,1290.51,0.0
7,police,10.0,AZ,MORTGAGE,110000.0,Source Verified,16.19,,,,...,B5,Jan-2018,Current,whole,Cash,19005.39,1765.84,994.61,771.23,0.0
8,parts,10.0,NV,MORTGAGE,65000.0,Source Verified,36.48,,,,...,C2,Feb-2018,Current,whole,Cash,18156.66,2703.22,1843.34,859.88,0.0
9,4th person,3.0,IL,RENT,30000.0,Not Verified,18.91,,,,...,A3,Mar-2018,Current,fractional,Cash,6077.13,391.15,322.87,68.28,0.0


# Notes

Probability is how likely something is to occur.
<br>
A random variable is one where the possible outcomes are a function of random phenomena (the probability for any event is between 0 and 1) and the summation of all probabilities = 1
<br>
Random seeds make sure the results are repeatable across machines.
<br>
Computers are not truly random, instead using seeds to create pseudorandomness.

Discrete vs continuous
<br>
Discrete is specific outcomes (like integer), continuous is infinitely many outcomes within a range (infinitely many decimal)

Uniform distribution: All outcomes have the same probability

Binomial distribution: symmetric, bell curve, etc

Normal/gaussian distribution: the peak in the middle is high probability, the low line is low probability

Expected value:
<br>
Law of large numbers: as the sample size increases the sample mean gets closer to the population mean
<br>
Uniform distribution: for a random variable, expected value is (a+b)/2, where a is the max possible and b is the min
<br>
Binomial distribution: mean is the expected value, which is equal to n trials * p probability
<br>
Standard normal distribution: mean is the expected value

Q1: What is the point of creating a random seed in probability?
<br>
A1: It allows you to replicate the results of a variable across different machines, since computers are not truly random and rely on these seeds for pseudorandomness.
<br>
Q2: What is the expected value for the number of heads in 50 trials of coin tosses?
<br>
A2: 25

### What is machine learning?

the study of computer algorithms that improve automatically through experience and by the use of data. uses statistics to find patterns in massive amounts of data. the concept that a computer program can learn and adapt to new data without human intervention.

## Preprocessing
1. clean up null value
2. data cleaning (dashes, odd characters, missing, outliers)
3. one-hot encoding
4. convert categorical values to numerical
5. standardization
6. deal with multicollinearity (a value influenced by multiple factors)

## Feature engineering
1. definition: selecting variables that are the best predictors

## cross validation
definition: test the performance of a machine learning model

# data cleaning

In [9]:
data_df = pd.read_csv('JEOPARDY_DATA.csv')
data_df.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,12/31/2004,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,12/31/2004,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,12/31/2004,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,12/31/2004,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,12/31/2004,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [13]:
data_df.columns = data_df.columns.str.replace(" ","")
data_df.columns

Index(['ShowNumber', 'AirDate', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

In [14]:
data_df.dtypes

ShowNumber     int64
AirDate       object
Round         object
Category      object
Value         object
Question      object
Answer        object
dtype: object

In [16]:
#mm/dd/yyyy is the current order before we process
data_df['AirDate'] = pd.to_datetime(data_df['AirDate'], format='%m/%d/%Y')
data_df.head()

Unnamed: 0,ShowNumber,AirDate,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [18]:
#assign is used to assign a new column
data_df = data_df.assign(month = lambda x : x['AirDate'].dt.month)
data_df.head()

Unnamed: 0,ShowNumber,AirDate,Round,Category,Value,Question,Answer,month
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,12
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,12
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,12
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,12
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,12


In [22]:
#axis = 1 is referencing the headers
data_df['month2'] = data_df.apply(lambda x : x['AirDate'].month, axis=1)
data_df.head()

Unnamed: 0,ShowNumber,AirDate,Round,Category,Value,Question,Answer,month,month2
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,12,12
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,12,12
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,12,12
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,12,12
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,12,12


In [23]:
#filters aside from boolindex and npwhere
value_list = data_df['Value'].tolist()
value_list = value_list[0:50]
value_list

['$200 ',
 '$200 ',
 '$200 ',
 '$200 ',
 '$200 ',
 '$200 ',
 '$400 ',
 '$400 ',
 '$400 ',
 '$400 ',
 '$400 ',
 '$400 ',
 '$600 ',
 '$600 ',
 '$600 ',
 '$600 ',
 '$600 ',
 '$600 ',
 '$800 ',
 '$800 ',
 '$800 ',
 '$800 ',
 '$2,000 ',
 '$800 ',
 '$1,000 ',
 '$1,000 ',
 '$1,000 ',
 '$1,000 ',
 '$1,000 ',
 '$400 ',
 '$400 ',
 '$400 ',
 '$400 ',
 '$400 ',
 '$400 ',
 '$800 ',
 '$800 ',
 '$800 ',
 '$800 ',
 '$800 ',
 '$1,200 ',
 '$2,000 ',
 '$1,200 ',
 '$1,200 ',
 '$1,200 ',
 '$1,600 ',
 '$1,600 ',
 '$1,600 ',
 '$1,600 ',
 '$1,600 ']

In [24]:
#we want to work with this data but we want to preserve the original values
filtered_list = list(filter(lambda num: int(num.replace("$","")
                                            .replace(" ","")
                                            .replace(",",""))>400, value_list))
filtered_list

['$600 ',
 '$600 ',
 '$600 ',
 '$600 ',
 '$600 ',
 '$600 ',
 '$800 ',
 '$800 ',
 '$800 ',
 '$800 ',
 '$2,000 ',
 '$800 ',
 '$1,000 ',
 '$1,000 ',
 '$1,000 ',
 '$1,000 ',
 '$1,000 ',
 '$800 ',
 '$800 ',
 '$800 ',
 '$800 ',
 '$800 ',
 '$1,200 ',
 '$2,000 ',
 '$1,200 ',
 '$1,200 ',
 '$1,200 ',
 '$1,600 ',
 '$1,600 ',
 '$1,600 ',
 '$1,600 ',
 '$1,600 ']