# Lab 1 - Data Exploration

I have provided a dataset of 25,000 undergraduate students who have previously taken a Data Science class. The target is to predict a student's grade in the course based on some personal info and previous coursework. Below is an excerpt of the documentation included with the dataset:

Your task is to explore this dataset using your python knowledge and following the data prepartion steps we discussed in class.

You will find that this dataset has a number of errors - some more obvious than others. All of these are errors that you might experience while working with real-world data. While there is no need to clean the data during this lab, you should think about which strategy or strategies are most appropriate for each error.

Questions you should keep in mind while exploring the data:
* What information does each feature contain? Are there any inconsistencies within the data or the documentation?
* Is there any missing data? If so, does it appear to be MCAR, MAR, or MNAR?
* Is there any information that appears to be wrong? If so, are we able to rectify it and how?
* Are there duplicate entries? Do these entries appear to be true duplicates of the same student?
* Are there any features which should be excluded from a model?

I encourage you to work as a group and divide up the work as much as possible while consoldiating your findings into a single notebook if possible. If you get stuck on a specific question (e.g. you don't know whether missing data is MCAR, MAR, or MNAR), feel free to move on and revisit later.

**Note:** Do not worry about finding everything in the allotted time. Data understanding can be time consuming. There are a lot of "mistakes" in this dataset and some of them may be tricky to find. A full solution will be released after class. The purpose of this lab is to gain experience asking questions to work towards fully understanding a dataset before advancing to the modeling stage.

## Read in Dataset

In [3]:
import numpy as np
import pandas as pd

In [4]:
dat = pd.read_csv('Lab1Data.csv', index_col=0)

### First look at the dataset

In [5]:
dat.head()

Unnamed: 0,Name,Sex,Date of Birth,Age,Height (inches),Weight (lb),Class Year,Major,School,"Grade in ""Intro to python""","Grade in ""Intro to Stats""",Final Data Science Grade
0,Gabriel Hernandez,0,1997-06-16,26,68.0,160,2019,Michigan State University,Computer Science & Engineering,C,"""88.7""",81.6
1,Kyle Poole,2,2003-09-03,20,66.0,150,2025,Harvard University,Computer Science & Engineering,C,"""86.6""",91.0
2,Reginald Hanson,2,1996-05-24,27,70.0,170,2018,Marquette University,Math,Pass,"""84.1""",85.1
3,Brian Horton,0,2002-03-14,21,69.0,155,2024,Marquette University,Computer Science & Engineering,Fail,"""82.4""",86.8
4,Paige Williams,1,2002-05-24,21,63.0,135,2024,Brown University,Computer Science & Engineering,A,"""89.2""",100.0


In [6]:
dat.dtypes

Name                           object
Sex                             int64
Date of Birth                  object
Age                             int64
Height (inches)               float64
Weight (lb)                     int64
Class Year                      int64
Major                          object
School                         object
Grade in "Intro to python"     object
Grade in "Intro to Stats"      object
Final Data Science Grade      float64
dtype: object

In [7]:
dat.describe()

Unnamed: 0,Sex,Age,Height (inches),Weight (lb),Class Year,Final Data Science Grade
count,25000.0,25000.0,24246.0,25000.0,25000.0,25000.0
mean,0.98828,25.46932,70.445022,141.5156,2019.53068,84.986424
std,0.814769,3.473998,21.091354,21.539952,3.473998,6.886378
min,0.0,20.0,55.0,95.0,2014.0,56.2
25%,0.0,22.0,64.0,125.0,2017.0,80.3
50%,1.0,25.0,66.0,145.0,2020.0,85.0
75%,2.0,28.0,69.0,160.0,2023.0,89.7
max,2.0,31.0,182.0,190.0,2025.0,100.0


## Your Work Here...

In [19]:
#Switch collumns for school and major by renaming
dat = dat.rename(columns={"Major":"School", "School":"Major"})
dat.head()

#Change Datatype: Remove quotes first, then change to float
dat['Grade in "Intro to Stats"'] = dat['Grade in "Intro to Stats"'].apply(lambda x: x.replace('"',""))
dat['Grade in "Intro to Stats"'] = dat['Grade in "Intro to Stats"'].astype(float)
dat.head()

#Why is sex 0,1,2??

Unnamed: 0,Name,Sex,Date of Birth,Age,Height (inches),Weight (lb),Class Year,Major,School,"Grade in ""Intro to python""","Grade in ""Intro to Stats""",Final Data Science Grade
0,Gabriel Hernandez,0,1997-06-16,26,68.0,160,2019,Michigan State University,Computer Science & Engineering,C,88.7,81.6
1,Kyle Poole,2,2003-09-03,20,66.0,150,2025,Harvard University,Computer Science & Engineering,C,86.6,91.0
2,Reginald Hanson,2,1996-05-24,27,70.0,170,2018,Marquette University,Math,Pass,84.1,85.1
3,Brian Horton,0,2002-03-14,21,69.0,155,2024,Marquette University,Computer Science & Engineering,Fail,82.4,86.8
4,Paige Williams,1,2002-05-24,21,63.0,135,2024,Brown University,Computer Science & Engineering,A,89.2,100.0


In [42]:
#Find all rows with nan
rows_with_nan = dat[dat.isnull().any(axis=1)]


#Finding missing values
values_missing = set()
for i in rows_with_nan.columns:
    for val in dat[i]:
        if pd.isnull(val):
            values_missing.add(str(i) + str(val))

print(values_missing)

{'Grade in "Intro to python"nan', 'Height (inches)nan'}


In [43]:
dat.loc[dat['Grade in "Intro to python"'].isna(),:].describe()



Unnamed: 0,Sex,Age,Height (inches),Weight (lb),Class Year,"Grade in ""Intro to Stats""",Final Data Science Grade
count,6292.0,6292.0,6115.0,6292.0,6292.0,6292.0,6292.0
mean,0.98061,25.454069,70.042191,141.616338,2019.545931,85.075858,84.926335
std,0.815682,3.478154,20.098332,21.520681,3.478154,6.967293,6.802853
min,0.0,20.0,56.0,95.0,2014.0,60.9,61.1
25%,0.0,22.0,64.0,125.0,2017.0,80.3,80.3
50%,1.0,26.0,66.0,145.0,2019.0,85.1,84.8
75%,2.0,28.0,69.0,160.0,2023.0,89.9,89.7
max,2.0,31.0,182.0,190.0,2025.0,100.0,100.0


In [44]:
dat.loc[dat['Height (inches)'].isna(),:].describe()

Unnamed: 0,Sex,Age,Height (inches),Weight (lb),Class Year,"Grade in ""Intro to Stats""",Final Data Science Grade
count,754.0,754.0,0.0,754.0,754.0,754.0,754.0
mean,0.710875,25.31565,,170.251989,2019.68435,84.996684,84.96313
std,0.957927,3.554335,,10.262305,3.554335,7.040984,7.128328
min,0.0,20.0,,155.0,2014.0,60.4,65.1
25%,0.0,22.0,,160.0,2017.0,80.7,80.2
50%,0.0,25.0,,170.0,2020.0,84.85,85.1
75%,2.0,28.0,,180.0,2023.0,90.1,90.0
max,2.0,31.0,,190.0,2025.0,100.0,100.0


In [45]:
dat[dat.duplicated(keep=False)]


Unnamed: 0,Name,Sex,Date of Birth,Age,Height (inches),Weight (lb),Class Year,Major,School,"Grade in ""Intro to python""","Grade in ""Intro to Stats""",Final Data Science Grade
11,Matthew Kelly,2,1993-05-20,30,69.0,150,2015,MIT,Computer Science & Engineering,C,76.0,98.0
42,Matthew Kelly,2,1993-05-20,30,69.0,150,2015,MIT,Computer Science & Engineering,C,76.0,98.0
69,Matthew Kelly,2,1993-05-20,30,69.0,150,2015,MIT,Computer Science & Engineering,C,76.0,98.0
125,Matthew Kelly,2,1993-05-20,30,69.0,150,2015,MIT,Computer Science & Engineering,C,76.0,98.0
4865,Matthew Kelly,2,1993-05-20,30,69.0,150,2015,MIT,Computer Science & Engineering,C,76.0,98.0
12345,Matthew Kelly,2,1993-05-20,30,69.0,150,2015,MIT,Computer Science & Engineering,C,76.0,98.0
12597,Matthew Kelly,2,1993-05-20,30,69.0,150,2015,MIT,Computer Science & Engineering,C,76.0,98.0
23578,Matthew Kelly,2,1993-05-20,30,69.0,150,2015,MIT,Computer Science & Engineering,C,76.0,98.0
