# Lilit Beglaryan

## About data:

The main goal is to analyze the relationship between the characteristics of respondents and their GPA. The description of the variables is given bellow:

* studentid -- ID of Respondent
* surveydate -- Survey conducting day
	
* age -- Age of Respondent
* ehpw -- Hours spent on extracurricular activities a week
* hpw -- Hours spent on studying a week
* hsleep -- Hours of sleep per day
* GPA -- Grand point average of student [0-100] 
	
* gender -- 0:male, 1:female
* job -- 1:Respondent has a job 0:Respondent does not have a job
* type -- 0: part-time 1: full-time
	
* marital.status
    1. single -- Respondent is single and has never been married
    2. married -- Respondent is married
    3. divorced -- Respondent is divorced or widowed
	

* imp -- Importance of geting/maintaining a high GPA (85 or greater)?
	1: Not Important - 5: Very Important


In [3]:
import numpy as np 
import pandas as pd
print(np.__version__)

1.24.1


## Part 1: Introduction to data

In [28]:
main_data = pd.read_csv("gpafactors.csv")

In [29]:
main_data.head()

Unnamed: 0.1,Unnamed: 0,studentid,surveydate,age,ehpw,hpw,hsleep,gpa,imp,gender,job,type,marital.status
0,1,57327,4/1/2018,,24.0,16.0,6.29,46.35,5,male,empl,par-time,single
1,2,231,1/1/2018,18.0,13.0,9.0,8.86,36.84,1,female,empl,par-time,single
2,3,10474,1/17/2018,26.0,20.0,19.0,6.43,65.07,5,male,unempl,full-time,divorced
3,4,8654,1/14/2018,20.0,19.0,11.0,7.71,33.87,2,female,empl,par-time,single
4,5,80185,5/7/2018,27.0,19.0,21.0,6.29,65.52,2,male,unempl,full-time,divorced


In [30]:
# Removing columns that don't have any description
del main_data["Unnamed: 0"]

In [7]:
main_data.head()

Unnamed: 0,studentid,surveydate,age,ehpw,hpw,hsleep,gpa,imp,gender,job,type,marital.status
0,57327,4/1/2018,,24.0,16.0,6.29,46.35,5,male,empl,par-time,single
1,231,1/1/2018,18.0,13.0,9.0,8.86,36.84,1,female,empl,par-time,single
2,10474,1/17/2018,26.0,20.0,19.0,6.43,65.07,5,male,unempl,full-time,divorced
3,8654,1/14/2018,20.0,19.0,11.0,7.71,33.87,2,female,empl,par-time,single
4,80185,5/7/2018,27.0,19.0,21.0,6.29,65.52,2,male,unempl,full-time,divorced


In [31]:
# Extracting info about each column. This will help to understand if any column has a missing value or not
main_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   studentid       2000 non-null   int64  
 1   surveydate      2000 non-null   object 
 2   age             1998 non-null   float64
 3   ehpw            1999 non-null   float64
 4   hpw             1998 non-null   float64
 5   hsleep          1998 non-null   float64
 6   gpa             2000 non-null   float64
 7   imp             2000 non-null   int64  
 8   gender          1999 non-null   object 
 9   job             1999 non-null   object 
 10  type            2000 non-null   object 
 11  marital.status  1999 non-null   object 
dtypes: float64(5), int64(2), object(5)
memory usage: 187.6+ KB


As we can see from the info summary age, ehpw, hsleep gender job and marital.status have missing values. 

##  Handling Missing values

* Drop rows/columns that contain missing values
* Fill in missing values
    1. In case of numeric variables most common technique is filling with mean or median
    2. In case of categorical variables with the most frequent one (mode)

In [32]:
main_data.dropna(subset = ["age"], inplace = True)

In [33]:
main_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1998 entries, 1 to 1999
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   studentid       1998 non-null   int64  
 1   surveydate      1998 non-null   object 
 2   age             1998 non-null   float64
 3   ehpw            1997 non-null   float64
 4   hpw             1996 non-null   float64
 5   hsleep          1996 non-null   float64
 6   gpa             1998 non-null   float64
 7   imp             1998 non-null   int64  
 8   gender          1997 non-null   object 
 9   job             1997 non-null   object 
 10  type            1998 non-null   object 
 11  marital.status  1997 non-null   object 
dtypes: float64(5), int64(2), object(5)
memory usage: 202.9+ KB


In [21]:
main_data.fillna(main_data.mean(numeric_only=True).round(1), inplace = True)
# main_data.select_dtypes(np.number).info()

In [25]:
main_data.info()
main_data.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1998 entries, 1 to 1999
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Unnamed: 0      1998 non-null   int64  
 1   studentid       1998 non-null   int64  
 2   surveydate      1998 non-null   object 
 3   age             1998 non-null   float64
 4   ehpw            1998 non-null   float64
 5   hpw             1998 non-null   float64
 6   hsleep          1998 non-null   float64
 7   gpa             1998 non-null   float64
 8   imp             1998 non-null   int64  
 9   gender          1998 non-null   object 
 10  job             1998 non-null   object 
 11  type            1998 non-null   object 
 12  marital.status  1998 non-null   object 
dtypes: float64(5), int64(3), object(5)
memory usage: 218.5+ KB


Unnamed: 0.1,Unnamed: 0,studentid,surveydate,age,ehpw,hpw,hsleep,gpa,imp,gender,job,type,marital.status
1,2,231,1/1/2018,18.0,13.0,9.0,8.86,36.84,1,female,empl,par-time,single
2,3,10474,1/17/2018,26.0,20.0,19.0,6.43,65.07,5,male,unempl,full-time,divorced
3,4,8654,1/14/2018,20.0,19.0,11.0,7.71,33.87,2,female,empl,par-time,single
4,5,80185,5/7/2018,27.0,19.0,21.0,6.29,65.52,2,male,unempl,full-time,divorced
5,6,69894,4/21/2018,21.0,14.0,13.0,8.14,48.61,3,female,empl,par-time,single


In [34]:
main_data.fillna(main_data.mode().iloc[0], inplace = True)

In [36]:
main_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1998 entries, 1 to 1999
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   studentid       1998 non-null   int64  
 1   surveydate      1998 non-null   object 
 2   age             1998 non-null   float64
 3   ehpw            1998 non-null   float64
 4   hpw             1998 non-null   float64
 5   hsleep          1998 non-null   float64
 6   gpa             1998 non-null   float64
 7   imp             1998 non-null   int64  
 8   gender          1998 non-null   object 
 9   job             1998 non-null   object 
 10  type            1998 non-null   object 
 11  marital.status  1998 non-null   object 
dtypes: float64(5), int64(2), object(5)
memory usage: 202.9+ KB


## Checking values of data

In [37]:
main_data.describe()

Unnamed: 0,studentid,age,ehpw,hpw,hsleep,gpa,imp
count,1998.0,1998.0,1998.0,1998.0,1998.0,1998.0,1998.0
mean,49943.434935,23.383383,20.039039,14.943443,7.004735,49.27485,3.008509
std,28908.774357,7.230808,3.042427,4.899215,0.827695,15.8198,1.423189
min,74.0,16.0,10.0,1.0,3.71,2.6,1.0
25%,25297.5,21.0,18.0,12.0,6.43,38.46,2.0
50%,49350.0,23.0,20.0,15.0,7.0,49.01,3.0
75%,74586.75,26.0,22.0,18.0,7.57,59.65,4.0
max,99986.0,304.0,29.0,33.0,9.43,99.5,5.0


**Do you see any problems here?**

In [38]:
main_data = main_data[main_data['age']!= 304]

In [48]:
main_data.describe()

Unnamed: 0,studentid,age,ehpw,hpw,hsleep,gpa,imp
count,1996.0,1996.0,1996.0,1996.0,1996.0,1996.0,1996.0
mean,49945.719439,23.245992,20.039078,14.941884,7.004955,49.269449,3.008517
std,28915.542321,3.581365,3.043787,4.901192,0.828005,15.823854,1.422493
min,74.0,16.0,10.0,1.0,3.71,2.6,1.0
25%,25280.25,21.0,18.0,12.0,6.43,38.45,2.0
50%,49350.0,23.0,20.0,15.0,7.0,49.01,3.0
75%,74595.5,26.0,22.0,18.0,7.57,59.63,4.0
max,99986.0,35.0,29.0,33.0,9.43,99.5,5.0


In [49]:
main_data.describe(include = 'object')

Unnamed: 0,gender,job,type,marital.status
count,1996,1996,1996,1996
unique,2,2,2,3
top,male,unempl,par-time,single
freq,1051,1028,1082,1376


**Do you see any problems here?**

In [40]:
np.unique(main_data['gender'])

array(['.', 'female', 'male'], dtype=object)

In [41]:
main_data = main_data[main_data['gender']!= "."]

In [42]:
main_data.describe(include = 'object')

Unnamed: 0,surveydate,gender,job,type,marital.status
count,1996,1996,1996,1996,1996
unique,158,2,2,2,3
top,3/11/2018,male,unempl,par-time,single
freq,24,1051,1028,1082,1376


**Anything else?**

In [47]:
main_data["surveydate"] = pd.to_datetime(main_data["surveydate"])

## Reseting Index

In [52]:
main_data.head()
main_data.describe()

Unnamed: 0,studentid,age,ehpw,hpw,hsleep,gpa,imp
count,1996.0,1996.0,1996.0,1996.0,1996.0,1996.0,1996.0
mean,49945.719439,23.245992,20.039078,14.941884,7.004955,49.269449,3.008517
std,28915.542321,3.581365,3.043787,4.901192,0.828005,15.823854,1.422493
min,74.0,16.0,10.0,1.0,3.71,2.6,1.0
25%,25280.25,21.0,18.0,12.0,6.43,38.45,2.0
50%,49350.0,23.0,20.0,15.0,7.0,49.01,3.0
75%,74595.5,26.0,22.0,18.0,7.57,59.63,4.0
max,99986.0,35.0,29.0,33.0,9.43,99.5,5.0


In [56]:
main_data.reset_index(drop = True, inplace = True)
main_data.reset_index(drop = True, inplace = True)
main_data.reset_index(drop = True, inplace = True)

In [58]:
main_data.head()
# main_data.describe()

Unnamed: 0,studentid,surveydate,age,ehpw,hpw,hsleep,gpa,imp,gender,job,type,marital.status
0,231,2018-01-01,18.0,13.0,9.0,8.86,36.84,1,female,empl,par-time,single
1,10474,2018-01-17,26.0,20.0,19.0,6.43,65.07,5,male,unempl,full-time,divorced
2,8654,2018-01-14,20.0,19.0,11.0,7.71,33.87,2,female,empl,par-time,single
3,80185,2018-05-07,27.0,19.0,21.0,6.29,65.52,2,male,unempl,full-time,divorced
4,69894,2018-04-21,21.0,14.0,13.0,8.14,48.61,3,female,empl,par-time,single
