## Homework 1 - Data Exploration using Census Data

In this homework assignment you will use census data from the 1994 and 1995 current population surveys conducted by the U.S. Census Bureau. A copy of this dataset is located at UCI Machine Learning repository, please see this [link](https://archive.ics.uci.edu/ml/datasets/Census-Income+%28KDD%29) to reach to the dataset website.

This dataset contains census data extracted from the 1994 and 1995 Current Population Surveys. We will only work with 'training' data (a link to that is provided to you below). The data contains 41 demographic and employment related variables. The abbreviated column names provided to you below. You are expected to read the documentation of this dataset, understand the features and preprocess this dataset. Additional information can be found [in the data description](https://archive.ics.uci.edu/ml/machine-learning-databases/census-income-mld/census-income.data.html) and [additional comments](https://archive.ics.uci.edu/ml/machine-learning-databases/census-income-mld/census-income.names).

Below you will find a code snippet to read the data into a pandas dataframe. You can alternatively download it yourself, extract and read it manually. The questions are shown in the subsequent cells. You need to provide your answers in this file. 

Your code is expected to have no errors, please make sure all your cells run properly before submitting (click Kernel->Restart & Run All to see if your code sequence works). 

__Please change the notebook's name and add your name before submitting.__

In [1]:
import pandas as pd
import numpy as np

In [2]:
import urllib.request
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/census-income-mld/census-income.data.gz'
census_dl_file = './census-income.data.gz'
urllib.request.urlretrieve(url, census_dl_file) 

('./census-income.data.gz', <http.client.HTTPMessage at 0x7fe8523902e0>)

In [3]:
columns_abbr = ['AAGE', 'ACLSWKR', 'ADTIND', 'ADTOCC', 'AHGA', 'AHRSPAY', 'AHSCOL', 'AMARITL', 'AMJIND', 'AMJOCC',
                'ARACE', 'AREORGN', 'ASEX', 'AUNMEM', 'AUNTYPE', 'AWKSTAT', 'CAPGAIN', 'CAPLOSS', 'DIVVAL', 
                'FILESTAT', 'GRINREG', 'GRINST', 'HHDFMX', 'HHDREL', 'MARSUPWT', 'MIGMTR1', 'MIGMTR3', 'MIGMTR4', 
                'MIGSAME', 'MIGSUN', 'NOEMP', 'PARENT', 'PEFNTVTY', 'PEMNTVTY', 'PENATVTY', 'PRCITSHP', 'SEOTR', 
                'VETQVA', 'VETYN', 'WKSWORK', 'YEAR', 'PTOTVALB']



In [4]:
# you can read from the compressed file
df = pd.read_csv('census-income.data.gz', compression='gzip', names=columns_abbr, sep=r',', skipinitialspace=True)
#df.head()

# you can also unzip census-income.zip and read using the following
# df = pd.read_csv('census-income.csv', names=columns_abbr, sep=r',', skipinitialspace=True)

In [5]:
# you can see the DataFrame's info panel here
#df.dtypes
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 199523 entries, 0 to 199522
Data columns (total 42 columns):
 #   Column    Non-Null Count   Dtype  
---  ------    --------------   -----  
 0   AAGE      199523 non-null  int64  
 1   ACLSWKR   199523 non-null  object 
 2   ADTIND    199523 non-null  int64  
 3   ADTOCC    199523 non-null  int64  
 4   AHGA      199523 non-null  object 
 5   AHRSPAY   199523 non-null  int64  
 6   AHSCOL    199523 non-null  object 
 7   AMARITL   199523 non-null  object 
 8   AMJIND    199523 non-null  object 
 9   AMJOCC    199523 non-null  object 
 10  ARACE     199523 non-null  object 
 11  AREORGN   198649 non-null  object 
 12  ASEX      199523 non-null  object 
 13  AUNMEM    199523 non-null  object 
 14  AUNTYPE   199523 non-null  object 
 15  AWKSTAT   199523 non-null  object 
 16  CAPGAIN   199523 non-null  int64  
 17  CAPLOSS   199523 non-null  int64  
 18  DIVVAL    199523 non-null  int64  
 19  FILESTAT  199523 non-null  object 
 20  GRIN

### Q1 - Basic Dataset Manipulation [15 pts]

#### Q1.1 Sort the instances in the dataset based on their age in descending order (AAGE attribute in the Census dataset corresponds to the age) and display top-20 instances. (5 pts)






In [6]:
# your code goes here
s = df.sort_values(by='AAGE', ascending=False)
s.head(20)

Unnamed: 0,AAGE,ACLSWKR,ADTIND,ADTOCC,AHGA,AHRSPAY,AHSCOL,AMARITL,AMJIND,AMJOCC,...,PEFNTVTY,PEMNTVTY,PENATVTY,PRCITSHP,SEOTR,VETQVA,VETYN,WKSWORK,YEAR,PTOTVALB
46356,90,Not in universe,0,0,1st 2nd 3rd or 4th grade,0,Not in universe,Widowed,Not in universe or children,Not in universe,...,Italy,Italy,United-States,Native- Born in the United States,0,No,1,0,95,- 50000.
96646,90,Not in universe,0,0,High school graduate,0,Not in universe,Married-civilian spouse present,Not in universe or children,Not in universe,...,Ecuador,Ecuador,Ecuador,Foreign born- Not a citizen of U S,0,Not in universe,2,0,95,- 50000.
80878,90,Not in universe,0,0,7th and 8th grade,0,Not in universe,Widowed,Not in universe or children,Not in universe,...,Italy,Italy,Italy,Foreign born- U S citizen by naturalization,0,Not in universe,2,0,94,- 50000.
126478,90,Not in universe,0,0,10th grade,0,Not in universe,Widowed,Not in universe or children,Not in universe,...,United-States,United-States,United-States,Native- Born in the United States,0,Not in universe,2,0,95,- 50000.
149712,90,Not in universe,0,0,Bachelors degree(BA AB BS),0,Not in universe,Never married,Not in universe or children,Not in universe,...,United-States,United-States,United-States,Native- Born in the United States,0,Not in universe,2,0,94,- 50000.
75243,90,Not in universe,0,0,7th and 8th grade,0,Not in universe,Widowed,Not in universe or children,Not in universe,...,Italy,Italy,Italy,Foreign born- U S citizen by naturalization,0,Not in universe,2,0,94,- 50000.
124107,90,Not in universe,0,0,Some college but no degree,0,Not in universe,Never married,Not in universe or children,Not in universe,...,United-States,United-States,United-States,Native- Born in the United States,0,Not in universe,2,0,95,- 50000.
143932,90,Not in universe,0,0,1st 2nd 3rd or 4th grade,0,Not in universe,Widowed,Not in universe or children,Not in universe,...,United-States,United-States,United-States,Native- Born in the United States,0,Not in universe,2,0,94,- 50000.
3360,90,Not in universe,0,0,1st 2nd 3rd or 4th grade,0,Not in universe,Widowed,Not in universe or children,Not in universe,...,United-States,United-States,United-States,Native- Born in the United States,0,Not in universe,2,0,95,- 50000.
34993,90,Not in universe,0,0,High school graduate,0,Not in universe,Widowed,Not in universe or children,Not in universe,...,United-States,United-States,United-States,Native- Born in the United States,0,Not in universe,2,0,95,- 50000.


#### Q1.2 Discover the average age for top 7% oldest instances? Display two decimal places for the average age. (5 pts)


In [7]:
# your code goes here
s = df.sort_values(by='AAGE', ascending=False)

topSeven=s.head(int(len(s)*(7/100)))

avg=topSeven["AAGE"].mean()

result = round(avg, 2)

print(result)

78.7


#### Q1.3 Find and display the median age of instances for each reported race (available in ARACE feature). [Hint: You can use groupby() function.] (5 pts)


In [8]:
# your code goes here

v = df.groupby(['ARACE'])['AAGE'].median()
print(v)

ARACE
Amer Indian Aleut or Eskimo    25.0
Asian or Pacific Islander      30.0
Black                          28.0
Other                          23.0
White                          34.0
Name: AAGE, dtype: float64


# Q2 - Identify the data scales and data types for each variable in census data. Identify the domain for each variable by checking the attributes' values. Then, create a data quality report for both categorical (nominal, ordinal) and continuous (interval, ratio) variables. [60 pts]


#### Q2.1 Identifying the characteristics (20 pts)
For data scales, identify whether an attribute is nominal, ordinal, interval, or ratio scale.
For data types, identify the domain and provide an appropriate data type (integer, float, String, date, Boolean). See if that data type is correct in your dataframe.
For domain, inspect each distinct value for each attribute. Identify missing values. use np.unique(). 

You can create an external csv file consisting five [or more] columns (including feature name, description, scale, data type, and domain) and display it. A template is provided (See ___features.csv___). To understand what these features represent, please check the original documentation.

Please include this file in your submission to get points.

In [9]:
# your code goes here
#information to help me organize the data
#Analyze every column 

for (names, series) in df.iteritems():
  print('ANALYZING THE COLUMN:', names)
  print('\tTotal number of records', series.size)
  print('\tNumber of missing values', series.isnull().sum())
  print('\tPercentage of missing values {0}%'.format(((series.isnull().sum()/series.size)*100)) )
  print('\tNumber of unique values', series.unique().size)



ANALYZING THE COLUMN: AAGE
	Total number of records 199523
	Number of missing values 0
	Percentage of missing values 0.0%
	Number of unique values 91
ANALYZING THE COLUMN: ACLSWKR
	Total number of records 199523
	Number of missing values 0
	Percentage of missing values 0.0%
	Number of unique values 9
ANALYZING THE COLUMN: ADTIND
	Total number of records 199523
	Number of missing values 0
	Percentage of missing values 0.0%
	Number of unique values 52
ANALYZING THE COLUMN: ADTOCC
	Total number of records 199523
	Number of missing values 0
	Percentage of missing values 0.0%
	Number of unique values 47
ANALYZING THE COLUMN: AHGA
	Total number of records 199523
	Number of missing values 0
	Percentage of missing values 0.0%
	Number of unique values 17
ANALYZING THE COLUMN: AHRSPAY
	Total number of records 199523
	Number of missing values 0
	Percentage of missing values 0.0%
	Number of unique values 1240
ANALYZING THE COLUMN: AHSCOL
	Total number of records 199523
	Number of missing values 0


In [10]:
# please fill the information for the data characteristics in the given csv file. read it here and display.
import pandas as pd 


columns_names = ['Feature', 'Description', 'Data Type', 'Data Scale', 'Domain', 'Missing Values']

abt_path = '/Users/miskiabdirizak/jupyterNotebook/HW1/features.csv'
features_df = pd.read_csv(abt_path, names = columns_names, on_bad_lines='skip')

#index_col = 'col_name'
# N, O, I, R are Nominal, Ordinal, Interval, Ratio
features_df



Unnamed: 0,Feature,Description,Data Type,Data Scale,Domain,Missing Values
0,AAGE,age,Int64,R,"{0, 1, 2, …}",0.0
1,ACLSWKR,class of worker,Object,N,"{Not in universe, Federal government, Local go...",
2,ADTIND,industry code,Int64,N,"{0, 1, 2, …}",0.0
3,ADTOCC,occupation code,Int64,N,"{0, 1, 2, …}",0.0
4,AHGA,education,Object,N,"{Children, 7th and 8th grade, 9th grade, 10th ...",0.0
5,AHRSPAY,wage per hour,Int64,R,"{0, 1, 2, …}",0.0
6,AHSCOL,enrolled in edu inst last wk,Object,N,"{Not in universe, High school, College or univ...",0.0
7,AMARITL,marital status,Object,N,"{Never married, Married-civilian spouse presen...",0.0
8,AMJIND,major industry code,Object,N,"{Not in universe or children, Entertainment, S...",0.0
9,AMJOCC,major occurpation code,Object,N,"{Not in universe, Professional specialty, Othe...",0.0


#### Q2.2 Create a Data Quality Report (40 pts)

Include the bar plots and histograms for visualizing the distributions. You may get your descriptions from features.csv file. 

The examples for a continuous and a categorical feature can be seen below. You do not need to use jupyter formatting provided here. You can print a DataFrame or read a csv, and display it. If you will read from a csv, make sure you have that csv in your submission zip file.

In [11]:
# your code goes here

import pandas as pd
df = pd.read_csv('census-income.data.gz', compression='gzip', names=columns_abbr, sep=r',', skipinitialspace=True)
features_df = pd.read_csv(abt_path, names = columns_names, on_bad_lines='skip')


# Data Quality Report for Continous Features
continous_Headers = ['Count', 'Miss %', 'Card.', 'Min', '1st Qrt.', 'Mean', 'Median', '3rd Qrt', 'Max', 'Std. Dev.']

#Continous features
conFeat = ['AAGE', 'AHRSPAY', 'CAPGAIN', 'CAPLOSS', 'DIVVAL','NOEMP', 'WKSWORK']
#Categorical Fetures [
catFeat = ['ACLSWKR', 'ADTIND', 'ADTOCC', 'AHGA','AHSCOL', 'AMARITL', 'AMJIND', 'AMJOCC','ARACE', 'AREORGN', 'ASEX', 'AUNMEM', 'AUNTYPE', 'AWKSTAT','FILESTAT', 'GRINREG', 'GRINST', 'HHDFMX', 'HHDREL', 'MARSUPWT', 'MIGMTR1', 'MIGMTR3', 'MIGMTR4', 
           'MIGSAME', 'MIGSUN','PARENT', 'PEFNTVTY', 'PEMNTVTY', 'PENATVTY', 'PRCITSHP', 'SEOTR', 
           'VETQVA', 'VETYN', 'YEAR', 'PTOTVALB']

dqr_continous = pd.DataFrame(index = conFeat, columns = continous_Headers)

dqr_continous.index.name = 'FEATURE NAME'

columns = df[conFeat]

#Count C
count = columns.count()
dqr_continous[continous_Headers[0]] = count

#MISS %
percents = ['']*len(conFeat)
for col in columns:
    percents[conFeat.index(col)] = 0.00

dqr_continous[continous_Headers[1]] = percents

#CARDINALITY
dqr_continous[continous_Headers[2]] = columns.nunique()

#MINIMUM
dqr_continous[continous_Headers[3]] = columns.min()

#1ST QUARTILE
dqr_continous[continous_Headers[4]] = columns.quantile(0.25)

#MEAN
dqr_continous[continous_Headers[5]] = round(columns.mean(), 2)

#MEDIAN
dqr_continous[continous_Headers[6]] = columns.median()

#3rd QUARTILE
dqr_continous[continous_Headers[7]] = columns.quantile(0.75)

#MAX
dqr_continous[continous_Headers[8]] = columns.max()

#STANDARD DEVIATION
dqr_continous[continous_Headers[9]] = round(columns.std(),2)

print(dqr_continous)
                                            

               Count  Miss %  Card.  Min  1st Qrt.    Mean  Median  3rd Qrt  \
FEATURE NAME                                                                  
AAGE          199523     0.0     91    0      15.0   34.49    33.0     50.0   
AHRSPAY       199523     0.0   1240    0       0.0   55.43     0.0      0.0   
CAPGAIN       199523     0.0    132    0       0.0  434.72     0.0      0.0   
CAPLOSS       199523     0.0    113    0       0.0   37.31     0.0      0.0   
DIVVAL        199523     0.0   1478    0       0.0  197.53     0.0      0.0   
NOEMP         199523     0.0      7    0       0.0    1.96     1.0      4.0   
WKSWORK       199523     0.0     53    0       0.0   23.17     8.0     52.0   

                Max  Std. Dev.  
FEATURE NAME                    
AAGE             90      22.31  
AHRSPAY        9999     274.90  
CAPGAIN       99999    4697.53  
CAPLOSS        4608     271.90  
DIVVAL        99999    1984.16  
NOEMP             6       2.37  
WKSWORK          52     

In [None]:
import pandas as pd

ddf = pd.read_csv('census-income.data.gz', compression='gzip', names=columns_abbr, sep=r',', skipinitialspace=True)
features_df = pd.read_csv(abt_path, names = columns_names, on_bad_lines='skip')

# Data Quality Report for Categorical Features
categorical_Headers = ['Count', 'Miss %', 'Card.', 'Mode', 'Mode Freq', 'Mode %', '2nd Mode', '2nd Mode Freq', '2nd Mode %']

#Categorical Fetures 
catFeat = ['ACLSWKR', 'ADTIND', 'ADTOCC', 'AHGA','AHSCOL', 'AMARITL', 'AMJIND', 'AMJOCC','ARACE', 'AREORGN', 'ASEX', 'AUNMEM', 'AUNTYPE', 'AWKSTAT','FILESTAT', 'GRINREG', 'GRINST', 'HHDFMX', 'HHDREL', 'MARSUPWT', 'MIGMTR1', 'MIGMTR3', 'MIGMTR4', 
           'MIGSAME', 'MIGSUN','PARENT', 'PEFNTVTY', 'PEMNTVTY', 'PENATVTY', 'PRCITSHP', 'SEOTR', 
           'VETQVA', 'VETYN', 'YEAR', 'PTOTVALB']


dqr_categorical = pd.DataFrame(index=catFeat, columns=categorical_Headers)

dqr_categorical.index.name = 'FEATURE NAME'

columns = df[catFeat]

#COUNT
count = columns.count()
dqr_categorical[categorical_Headers[0]] = count


#CARDINALITY
dqr_categorical[categorical_Headers[2]] = columns.nunique()


#preparing arrays for storing data
amt = len(catFeat)
missPercents = ['']*amt
modeFreqs = ['']*amt
modes = ['']*amt
modes2 = ['']*amt
modePercents = ['']*amt
modeFreqs2 = ['']*amt
modePercents2 = ['']*amt

for col in columns:
    values = columns[col].value_counts()
    index = catFeat.index(col)

    #MISS %
    try:
        qMarksCount = values.loc[' ?']
        percent = (qMarksCount/count[index]) * 100
        missPercents[index] = round(percent, 2)

        dqr_categorical['Card.'][index] -= 1
    except Exception as e:
            missPercents[index] = 0.00

    #MODES
    mode = values.index[0]
    mode2 = values.index[1]
    modes[index] = mode
    modes2[index] = mode2
        
    #MODE FREQ
    modeCount = values.loc[mode]
    modeCount2 = values.loc[mode2]
    modeFreqs[index] = modeCount
    modeFreqs2[index] = modeCount2

    #MODE %
    miss = missPercents[index]
        
    modePer = (modeCount/(count[index]*((100-miss)/100)))*100
    modePercents[index] = round(modePer, 2)
        
    modePer2 = (modeCount2/(count[index]*((100-miss)/100)))*100
    modePercents2[index] = round(modePer2, 2)

dqr_categorical[categorical_Headers[1]] = missPercents
dqr_categorical[categorical_Headers[3]] = modes
dqr_categorical[categorical_Headers[4]] = modeFreqs
dqr_categorical[categorical_Headers[5]] = modePercents
dqr_categorical[categorical_Headers[6]] = modes2
dqr_categorical[categorical_Headers[7]] = modeFreqs2
dqr_categorical[categorical_Headers[8]] = modePercents2

print(dqr_categorical)

 ### Example Data Quality Report for Continuous Variables
| Feature | Desc. | Count | % of Missing | Card. | Min. | Q1 | Median | Q3 | Max. | Mean | Std. Dev. | Notes |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| AAGE | Age | 199,523 | 0 | 91 | 0 | 15 | 33 | 50 | 90 | 34.49 | 22.31 |  |

### Example Data Quality Report for Categorical Variables
| Feature | Desc. | Count | % of Missing | Card. | Mode | Mode Freq. | Mode % | 2nd Mode | 2nd Mode Freq. | 2nd Mode Perc | Notes |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| ACLSWKR | Class of worker | 199,523 | 0 | 9 | Not in Universe | 100,245 | 50.24 | Private | 72,028 |  36.10 |  |

In [None]:
# bar plots for categorical features 
from tkinter import Label
import matplotlib
import matplotlib.pyplot as plt
import pandas as pd
#bar ol
df = pd.read_csv('census-income.data.gz', compression='gzip', names=columns_abbr, sep=r',', skipinitialspace=True)
features_df = pd.read_csv(abt_path, names = columns_names, on_bad_lines='skip')

ffig,ax=plt.subplots(4,8,figsize=(20,20))

df.value_counts().plot(kind = 'bar', ax=ax[0,0],title=df)

plt.show()



In [None]:
# histograms for continuous features
from tkinter import Label
import matplotlib
import matplotlib.pyplot as plt
import pandas as pd
 
df = pd.read_csv('census-income.data.gz', compression='gzip', names=columns_abbr, sep=r',', skipinitialspace=True)
features_df = pd.read_csv('/Users/miskiabdirizak/jupyterNotebook/HW1/features.csv', names=columns_abbr, sep=r',', skipinitialspace=True)

df_for_hist = df._get_numeric_data()
print(df_for_hist)

col_list = []
for (name, series) in df.iteritems():
  if series.dtype != 'object':
    col_list.append(name)

fig, ax = plt.subplots(3,3,figsize=(20,20)) # get a bigger figure
df_for_hist = df[col_list]
df_for_hist.hist(bins=14, alpha=0.5, ax=ax[0][0])

### Q3 Outlier Identification (25 pts)

#### Q3.1 For each continuous feature, identify the outliers using the IQR method. (15 pts)
For each feature, report the lower and upper bounds and number of instances that are identified as outliers.
Then, display boxplots and discuss if your outliers analysis makes sense. Discuss why you would (or would not) use the IQR method.


In [None]:
# your code goes here
import numpy as np
import pandas as pd

ddf = pd.read_csv('census-income.data.gz', compression='gzip', names=columns_abbr, sep=r',', skipinitialspace=True)
features_df = pd.read_csv(abt_path, names = columns_names, on_bad_lines='skip')

ffig, axes = plt.subplots(nrows=4, ncols=3, figsize=(40, 20))

x = 0
y = 0
x_Max = 2

#IQR Column Names
iqr_cols = ['Feature', 'IQR', 'Lower Bound', 'Upper Bound']
#Build Data Frame
iqr_df = pd.DataFrame(columns=iqr_cols)
for col in df.columns: #iterate through col and check for continous features
    if(features_df.loc[col]['scale (N, O, I, R)'] == 'R' or features_df.loc[col]['scale (N, O, I, R)'] == 'I'):
        Q3 = dqr_continous.loc[col]['Q3'] #get q1 anf q3 from continous data quality report
        Q1 = dqr_continous.loc[col]['Q1']
        iqr = Q3-Q1 #solve for iqr
        entry = {'Feature':col, 'IQR':iqr, 'Lower Bound':Q1-(1.5*iqr), 'Upper Bound':Q3+(1.5*iqr)} #solve for bounds
        iqr_df.loc[len(iqr_df)] = entry
        a = df.boxplot(column=col, ax=axes[y,x], grid=False, vert=False, ) #Build boxplot
        a.set_title(col)
        if(x == x_max): #checking for outliers
            x = 0
            y += 1
        else:
            x += 1
iqr_df.set_index('Feature', inplace=True)
print(iqr_df)

#No, the outliers do not make sense b/c some outliers report back negative which is impossible. 
#IQR is not suitable because some of the values are in the extreme. Some are either too large or too small. 

#### Q3.2. Replace the outlying values in WKSWORK, MARSUPWT and AAGE features. (10 pts)
Use clamping with upper and lower bounds you found in the previous step. Report how many individual cells are being updated for each feature.


In [None]:
# your code goes
import numpy as np
df = pd.read_csv('census-income.data.gz', compression='gzip', names=columns_abbr, sep=r',', skipinitialspace=True)
features_df = pd.read_csv(abt_path, names = columns_names, on_bad_lines='skip')

Q3 = dqr_continous.loc[c]['Q3'] #get q1 anf q3 from continous data quality report
Q1 = dqr_continous.loc[c]['Q1']

iqr = Q3-Q1 #solve for iqr
lowerBound = q1-(1.5*iqr)
upperBound = q3+(1.5*iqr)
np.clip(iqr_df, lowerBound,upperBound, out = iqr_df)
print(iqr_df)

### Q4 Normalization (10 pts)

Normalize the MARSUPWT, AAGE, NOEMP, CAPGAIN, and CAPLOSS features
* Use range normalization for MARSUPWT feature
* Use robust scaling for NOEMP feature
* Use Z-score normalization for AAGE feature
* Use log scaling for CAPGAIN and CAPLOSS features (impute your zero values with 10 before transformation to avoid undefined  $log_{b}0$)

In [15]:
# your code goes here
from sklearn import preprocessing

df = pd.read_csv('census-income.data.gz', compression='gzip', names=columns_abbr, sep=r',', skipinitialspace=True)
features_df = pd.read_csv(abt_path, names = columns_names, on_bad_lines='skip')

#Range Normalization
print("Range Normalization")
range_normalization = (df["MARSUPWT"]-df["MARSUPWT"].min()) / (df["MARSUPWT"].max()-df["MARSUPWT"].min())
print(range_normalization)

#Robust Scaling
print("Robust Scaling")
robust_scaling = (df["NOEMP"] - df["NOEMP"].median()) / (df["NOEMP"].quantile(0.75) - df["NOEMP"].quantile(0.25))
print(robust_scaling)

#Z-Score Normalization
df['AAGE'] = (df['AAGE'] - df['AAGE'].mean())/df['AAGE'].std(ddof=0)
print(df['AAGE'])

# Log Scaling
print("Log Scaling")
df = pd.DataFrame(np.random.uniform(-1,1,(199524,1)))
df['CAPGAIN'] = (1+df[0])/2 # (-1,1] -> (0,1]
df['CAPGAIN'] = np.log(df['CAPGAIN'])
print(df)

df = pd.DataFrame(np.random.uniform(-1,1,(199524,1)))
df['CAPLOSS'] = (1+df[0])/2 # (-1,1] -> (0,1]
df['CAPLOSS'] = np.log(df['CAPLOSS'])
print(df)




Range Normalization
0         0.089278
1         0.054552
2         0.051244
3         0.092396
4         0.055391
            ...   
199518    0.049274
199519    0.034875
199520    0.101252
199521    0.248517
199522    0.096262
Name: MARSUPWT, Length: 199523, dtype: float64
Robust Scaling
0        -0.25
1         0.00
2        -0.25
3        -0.25
4        -0.25
          ... 
199518   -0.25
199519    0.00
199520    1.25
199521   -0.25
199522    1.25
Name: NOEMP, Length: 199523, dtype: float64
0         1.725879
1         1.053560
2        -0.739291
3        -1.142682
4        -1.097861
            ...   
199518    2.353376
199519    1.367309
199520    0.560526
199521   -0.828933
199522   -0.111793
Name: AAGE, Length: 199523, dtype: float64
Log Scaling
               0   CAPGAIN
0      -0.626780 -1.678734
1      -0.018161 -0.711475
2       0.591698 -0.228346
3      -0.565559 -1.526843
4      -0.564572 -1.524574
...          ...       ...
199519 -0.810863 -2.358433
199520 -0.599413 -1.