# Data Research

## Data Cleaning/Bootsrapping

### Exercise 1
Perform data cleaning on the downloaded data. If the data is already clean, add a bunch of dummy ‘bad’ rows and columns which you can then demonstrate how to properly clean the
data. Should perform at least the following (and other examples you can think of):  
a. Drop Nulls  
b. Remove irrelevant columns  
c. Standardize a date/time column  
d. Standardize a string column  
e. Remove outliers  
f. Winsorize outliers using both clip and a winsorize function.

In [1]:
from IPython.core.interactiveshell import InteractiveShell

InteractiveShell.ast_node_interactivity = "all"

In [2]:
import pandas as pd

aaplData = pd.read_csv('aapl_stockData.csv')

aaplData['Da Boss'] = 'Tim Apple'  # Adding irrelevant column
aaplData['Ticker'] = 'AAPL'  # Added ticker name column
aaplData.iloc[2:10, 12] = 'aapl '  # Change some value in ticker column to unstandardized value
aaplData['MyDate'] = pd.to_datetime(aaplData['Date'])  # Added new datetime column to demonstrate cleaning
aaplData['MyDate'] = aaplData['MyDate'].dt.strftime('%B %d %Y')  # Reformat new datetime column to demonstrate cleaning

aaplData

Unnamed: 0.1,Unnamed: 0,Date,High,Low,Open,Close,Volume,Adj Close,Daily Return,1-week MA Volume,1-week MA Daily Return,Da Boss,Ticker,MyDate
0,0,2015-10-26,29.532499,28.730000,29.520000,28.820000,265335200.0,26.632969,,,,Tim Apple,AAPL,October 26 2015
1,1,2015-10-27,29.135000,28.497499,28.850000,28.637501,279537600.0,26.464319,-0.006332,,,Tim Apple,AAPL,October 27 2015
2,2,2015-10-28,29.825001,29.014999,29.232500,29.817499,342205600.0,27.554773,0.041205,,,Tim Apple,aapl,October 28 2015
3,3,2015-10-29,30.172501,29.567499,29.674999,30.132500,204909200.0,27.845867,0.010564,,,Tim Apple,aapl,October 29 2015
4,4,2015-10-30,30.305000,29.862499,30.247499,29.875000,197461200.0,27.607912,-0.008546,257889760.0,,Tim Apple,aapl,October 30 2015
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1254,1254,2020-10-19,120.419998,115.660004,119.959999,115.980003,120639300.0,115.980003,-0.025542,152397020.0,-0.013857,Tim Apple,AAPL,October 19 2020
1255,1255,2020-10-20,118.980003,115.629997,116.199997,117.510002,124423700.0,117.510002,0.013192,124815660.0,-0.005914,Tim Apple,AAPL,October 20 2020
1256,1256,2020-10-21,118.709999,116.449997,116.669998,116.870003,89946000.0,116.870003,-0.005446,112592400.0,-0.007152,Tim Apple,AAPL,October 21 2020
1257,1257,2020-10-22,118.040001,114.589996,117.449997,115.750000,101709700.0,115.750000,-0.009583,110422500.0,-0.008276,Tim Apple,AAPL,October 22 2020


#### a. Drop Nulls

In [3]:
aaplData.dropna(inplace=True)
aaplData

Unnamed: 0.1,Unnamed: 0,Date,High,Low,Open,Close,Volume,Adj Close,Daily Return,1-week MA Volume,1-week MA Daily Return,Da Boss,Ticker,MyDate
5,5,2015-11-02,30.340000,29.902500,30.200001,30.295000,128813200.0,27.996040,0.014059,230585360.0,0.010190,Tim Apple,aapl,November 02 2015
6,6,2015-11-03,30.872499,30.174999,30.197500,30.642500,182076000.0,28.317165,0.011471,211093040.0,0.013750,Tim Apple,aapl,November 03 2015
7,7,2015-11-04,30.955000,30.405001,30.782499,30.500000,179544400.0,28.185480,-0.004650,178560800.0,0.004579,Tim Apple,aapl,November 04 2015
8,8,2015-11-05,30.672501,30.045000,30.462500,30.230000,158210800.0,28.055548,-0.008852,169221120.0,0.000696,Tim Apple,aapl,November 05 2015
9,9,2015-11-06,30.452499,30.155001,30.277500,30.264999,132169200.0,28.088028,0.001158,156162720.0,0.002637,Tim Apple,aapl,November 06 2015
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1254,1254,2020-10-19,120.419998,115.660004,119.959999,115.980003,120639300.0,115.980003,-0.025542,152397020.0,-0.013857,Tim Apple,AAPL,October 19 2020
1255,1255,2020-10-20,118.980003,115.629997,116.199997,117.510002,124423700.0,117.510002,0.013192,124815660.0,-0.005914,Tim Apple,AAPL,October 20 2020
1256,1256,2020-10-21,118.709999,116.449997,116.669998,116.870003,89946000.0,116.870003,-0.005446,112592400.0,-0.007152,Tim Apple,AAPL,October 21 2020
1257,1257,2020-10-22,118.040001,114.589996,117.449997,115.750000,101709700.0,115.750000,-0.009583,110422500.0,-0.008276,Tim Apple,AAPL,October 22 2020


#### b. Remove irrelevant columns

In [4]:
aaplData.drop(['Da Boss'], axis=1, inplace=True)
aaplData.drop(['Unnamed: 0'], axis=1, inplace=True)
aaplData

Unnamed: 0,Date,High,Low,Open,Close,Volume,Adj Close,Daily Return,1-week MA Volume,1-week MA Daily Return,Ticker,MyDate
5,2015-11-02,30.340000,29.902500,30.200001,30.295000,128813200.0,27.996040,0.014059,230585360.0,0.010190,aapl,November 02 2015
6,2015-11-03,30.872499,30.174999,30.197500,30.642500,182076000.0,28.317165,0.011471,211093040.0,0.013750,aapl,November 03 2015
7,2015-11-04,30.955000,30.405001,30.782499,30.500000,179544400.0,28.185480,-0.004650,178560800.0,0.004579,aapl,November 04 2015
8,2015-11-05,30.672501,30.045000,30.462500,30.230000,158210800.0,28.055548,-0.008852,169221120.0,0.000696,aapl,November 05 2015
9,2015-11-06,30.452499,30.155001,30.277500,30.264999,132169200.0,28.088028,0.001158,156162720.0,0.002637,aapl,November 06 2015
...,...,...,...,...,...,...,...,...,...,...,...,...
1254,2020-10-19,120.419998,115.660004,119.959999,115.980003,120639300.0,115.980003,-0.025542,152397020.0,-0.013857,AAPL,October 19 2020
1255,2020-10-20,118.980003,115.629997,116.199997,117.510002,124423700.0,117.510002,0.013192,124815660.0,-0.005914,AAPL,October 20 2020
1256,2020-10-21,118.709999,116.449997,116.669998,116.870003,89946000.0,116.870003,-0.005446,112592400.0,-0.007152,AAPL,October 21 2020
1257,2020-10-22,118.040001,114.589996,117.449997,115.750000,101709700.0,115.750000,-0.009583,110422500.0,-0.008276,AAPL,October 22 2020


#### c. Standardize a date/time column

In [5]:
aaplData['MyDate'] = pd.to_datetime(aaplData['MyDate'])  # Converted back to datetime
aaplData['MyDate'] = aaplData['MyDate'].dt.strftime('%Y-%m-%d')  # Reformat new datetime column to demonstrate cleaning
aaplData

Unnamed: 0,Date,High,Low,Open,Close,Volume,Adj Close,Daily Return,1-week MA Volume,1-week MA Daily Return,Ticker,MyDate
5,2015-11-02,30.340000,29.902500,30.200001,30.295000,128813200.0,27.996040,0.014059,230585360.0,0.010190,aapl,2015-11-02
6,2015-11-03,30.872499,30.174999,30.197500,30.642500,182076000.0,28.317165,0.011471,211093040.0,0.013750,aapl,2015-11-03
7,2015-11-04,30.955000,30.405001,30.782499,30.500000,179544400.0,28.185480,-0.004650,178560800.0,0.004579,aapl,2015-11-04
8,2015-11-05,30.672501,30.045000,30.462500,30.230000,158210800.0,28.055548,-0.008852,169221120.0,0.000696,aapl,2015-11-05
9,2015-11-06,30.452499,30.155001,30.277500,30.264999,132169200.0,28.088028,0.001158,156162720.0,0.002637,aapl,2015-11-06
...,...,...,...,...,...,...,...,...,...,...,...,...
1254,2020-10-19,120.419998,115.660004,119.959999,115.980003,120639300.0,115.980003,-0.025542,152397020.0,-0.013857,AAPL,2020-10-19
1255,2020-10-20,118.980003,115.629997,116.199997,117.510002,124423700.0,117.510002,0.013192,124815660.0,-0.005914,AAPL,2020-10-20
1256,2020-10-21,118.709999,116.449997,116.669998,116.870003,89946000.0,116.870003,-0.005446,112592400.0,-0.007152,AAPL,2020-10-21
1257,2020-10-22,118.040001,114.589996,117.449997,115.750000,101709700.0,115.750000,-0.009583,110422500.0,-0.008276,AAPL,2020-10-22


#### d. Standardize a string column

In [6]:
aaplData['Ticker'] = aaplData['Ticker'].apply(lambda t:t.upper().strip())  # Make everything upper case and remove space
aaplData

Unnamed: 0,Date,High,Low,Open,Close,Volume,Adj Close,Daily Return,1-week MA Volume,1-week MA Daily Return,Ticker,MyDate
5,2015-11-02,30.340000,29.902500,30.200001,30.295000,128813200.0,27.996040,0.014059,230585360.0,0.010190,AAPL,2015-11-02
6,2015-11-03,30.872499,30.174999,30.197500,30.642500,182076000.0,28.317165,0.011471,211093040.0,0.013750,AAPL,2015-11-03
7,2015-11-04,30.955000,30.405001,30.782499,30.500000,179544400.0,28.185480,-0.004650,178560800.0,0.004579,AAPL,2015-11-04
8,2015-11-05,30.672501,30.045000,30.462500,30.230000,158210800.0,28.055548,-0.008852,169221120.0,0.000696,AAPL,2015-11-05
9,2015-11-06,30.452499,30.155001,30.277500,30.264999,132169200.0,28.088028,0.001158,156162720.0,0.002637,AAPL,2015-11-06
...,...,...,...,...,...,...,...,...,...,...,...,...
1254,2020-10-19,120.419998,115.660004,119.959999,115.980003,120639300.0,115.980003,-0.025542,152397020.0,-0.013857,AAPL,2020-10-19
1255,2020-10-20,118.980003,115.629997,116.199997,117.510002,124423700.0,117.510002,0.013192,124815660.0,-0.005914,AAPL,2020-10-20
1256,2020-10-21,118.709999,116.449997,116.669998,116.870003,89946000.0,116.870003,-0.005446,112592400.0,-0.007152,AAPL,2020-10-21
1257,2020-10-22,118.040001,114.589996,117.449997,115.750000,101709700.0,115.750000,-0.009583,110422500.0,-0.008276,AAPL,2020-10-22


#### e. Remove outliers
We will remove extreme high and low, i.e. assuming there is potential ghost prints.

In [7]:
print(f'High data:')
print(f'Count: {aaplData["High"].count():,}')
print(f'Average: {aaplData["High"].mean():,}')
print(f'Min: {aaplData["High"].min():,}')
print(f'Max: {aaplData["High"].max():,}')
print(f'5%: {aaplData["High"].quantile(0.05):,}')
print(f'95%: {aaplData["High"].quantile(0.95):,}')
print()

print(f'Low data:')
print(f'Count: {aaplData["Low"].count():,}')
print(f'Average: {aaplData["Low"].mean():,}')
print(f'Min: {aaplData["Low"].min():,}')
print(f'Max: {aaplData["Low"].max():,}')
print(f'5%: {aaplData["Low"].quantile(0.05):,}')
print(f'95%: {aaplData["Low"].quantile(0.95):,}')

High data:
Count: 1,254
Average: 48.84273129330868
Min: 22.917499542236328
Max: 137.97999572753906
5%: 24.55075006484985
95%: 98.76162414550775

Low data:
Count: 1,254
Average: 47.82803430207418
Min: 22.36750030517578
Max: 130.52999877929688
5%: 24.13187532424927
95%: 96.21212425231928


In [8]:
# Remove High outliers
highMin = aaplData["High"].quantile(0.05)
highMax = aaplData["High"].quantile(0.95)
aaplDataNoOutliers = aaplData[aaplData.High.between(highMin, highMax)]

# Remove Low outliers
lowMin = aaplDataNoOutliers["Low"].quantile(0.05)
lowMax = aaplDataNoOutliers["Low"].quantile(0.95)
aaplDataNoOutliers = aaplDataNoOutliers[aaplDataNoOutliers.Low.between(lowMin, lowMax)]

In [9]:
print(f'High data (cleaned):')
print(f'Count: {aaplDataNoOutliers["High"].count():,}')
print(f'Average: {aaplDataNoOutliers["High"].mean():,}')
print(f'Min: {aaplDataNoOutliers["High"].min():,}')
print(f'Max: {aaplDataNoOutliers["High"].max():,}')
print(f'5%: {aaplDataNoOutliers["High"].quantile(0.05):,}')
print(f'95%: {aaplDataNoOutliers["High"].quantile(0.95):,}')
print()

print(f'Low data (cleaned):')
print(f'Count: {aaplDataNoOutliers["Low"].count():,}')
print(f'Average: {aaplDataNoOutliers["Low"].mean():,}')
print(f'Min: {aaplDataNoOutliers["Low"].min():,}')
print(f'Max: {aaplDataNoOutliers["Low"].max():,}')
print(f'5%: {aaplDataNoOutliers["Low"].quantile(0.05):,}')
print(f'95%: {aaplDataNoOutliers["Low"].quantile(0.95):,}')

High data (cleaned):
Count: 1,014
Average: 45.24749016996907
Min: 26.41250038146973
Max: 80.86000061035156
5%: 27.383249855041505
95%: 72.44462814331054

Low data (cleaned):
Count: 1,014
Average: 44.39478801599386
Min: 26.020000457763672
Max: 78.96749877929688
5%: 27.009000682830806
95%: 70.63737716674804


#### f. Winsorize outliers using both clip and a winsorize function.

In [10]:
aaplData

Unnamed: 0,Date,High,Low,Open,Close,Volume,Adj Close,Daily Return,1-week MA Volume,1-week MA Daily Return,Ticker,MyDate
5,2015-11-02,30.340000,29.902500,30.200001,30.295000,128813200.0,27.996040,0.014059,230585360.0,0.010190,AAPL,2015-11-02
6,2015-11-03,30.872499,30.174999,30.197500,30.642500,182076000.0,28.317165,0.011471,211093040.0,0.013750,AAPL,2015-11-03
7,2015-11-04,30.955000,30.405001,30.782499,30.500000,179544400.0,28.185480,-0.004650,178560800.0,0.004579,AAPL,2015-11-04
8,2015-11-05,30.672501,30.045000,30.462500,30.230000,158210800.0,28.055548,-0.008852,169221120.0,0.000696,AAPL,2015-11-05
9,2015-11-06,30.452499,30.155001,30.277500,30.264999,132169200.0,28.088028,0.001158,156162720.0,0.002637,AAPL,2015-11-06
...,...,...,...,...,...,...,...,...,...,...,...,...
1254,2020-10-19,120.419998,115.660004,119.959999,115.980003,120639300.0,115.980003,-0.025542,152397020.0,-0.013857,AAPL,2020-10-19
1255,2020-10-20,118.980003,115.629997,116.199997,117.510002,124423700.0,117.510002,0.013192,124815660.0,-0.005914,AAPL,2020-10-20
1256,2020-10-21,118.709999,116.449997,116.669998,116.870003,89946000.0,116.870003,-0.005446,112592400.0,-0.007152,AAPL,2020-10-21
1257,2020-10-22,118.040001,114.589996,117.449997,115.750000,101709700.0,115.750000,-0.009583,110422500.0,-0.008276,AAPL,2020-10-22


In [11]:
# Clipping High column
# Take anything below min and floor at min, anything above max, celling at max

highMin = aaplData["High"].quantile(0.05)
highMax = aaplData["High"].quantile(0.95)

aaplDataClipped = aaplData.copy()
aaplDataClipped["High"] = aaplDataClipped["High"].clip(highMin, highMax)

print(f'Count: {aaplDataClipped["High"].count():,}')
print(f'Average: {aaplDataClipped["High"].mean():,}')
print(f'Min: {aaplDataClipped["High"].min():,}')
print(f'Max: {aaplDataClipped["High"].max():,}')
print(f'5%: {aaplDataClipped["High"].quantile(0.05):,}')
print(f'95%: {aaplDataClipped["High"].quantile(0.95):,}')

Count: 1,254
Average: 47.93383540856226
Min: 24.55075006484985
Max: 98.76162414550775
5%: 24.551887373924252
95%: 98.5915684509277


In [12]:
# Winsorize
import scipy.stats.mstats

aaplDataWinsorized = aaplData.copy()
aaplDataWinsorized["Low"] = scipy.stats.mstats.winsorize(aaplDataWinsorized["Low"], [.05, .05])

print(f'Count: {aaplDataWinsorized["Low"].count():,}')
print(f'Average: {aaplDataWinsorized["Low"].mean():,}')
print(f'Min: {aaplDataWinsorized["Low"].min():,}')
print(f'Max: {aaplDataWinsorized["Low"].max():,}')
print(f'5%: {aaplDataWinsorized["Low"].quantile(0.05):,}')
print(f'95%: {aaplDataWinsorized["Low"].quantile(0.95):,}')

Count: 1,254
Average: 47.00251188840973
Min: 24.107500076293945
Max: 96.48999786376952
5%: 24.13187532424927
95%: 96.21212425231928
