# COGS 108 - Data Checkpoint

# Names

- Ben Liou
- Fanqi Lin
- Tasnia Jamal
- Kendrick Nguyen

<a id='research_question'></a>
# Research Question

*Is there a correlation between students' Big Five personality traits (i.e. openness, conscientiousness, extraversion, agreeableness, and neuroticism) and academic performance?*

# Dataset(s)

## *Dataset 1*

- Dataset Name: Engineering Graduate Salary Prediction
- Link to the dataset: https://www.kaggle.com/datasets/manishkc06/engineering-graduate-salary-prediction
- Number of observations: 2998 students
- Description: Originally used to predict the salary of an engineering graduate, the dataset provides labels, such as College GPA and AMCAT Personality test scores. The forementioned labels will be used to measure academic performance and Big Five personality factors.

## *Dataset 2*

- Dataset Name: Correlation of Personality Traits & GPA for Jordanian Medical Students
- Link to the dataset: https://data.mendeley.com/datasets/5rwpwr9rf2/1
- Number of observations: 307 students
- Description: A complete dataset on Hashemite University, Jordan's medical students, identifying their Big Five Model personality traits and College GPA. We plan to use this dataset in combination as it already yields the same labels as the first dataset.


# Setup

Import the following packages using their common shortened name found in parentheses:
- `numpy (np)`
- `pandas (pd)`

In [18]:
import numpy as np
import pandas as pd

Toggle `EXPORT_CSV` to `True` if you want to export the all cleaned datasets, else `False`

In [19]:
EXPORT_CSV = True

Read the first dataset `Engineering_graduate_salary.csv` and assign it to the variable `engineering_df`

In [20]:
engineering_df = pd.read_csv('data/raw_data/Engineering_graduate_salary.csv')

engineering_df.head()

Unnamed: 0,ID,Gender,DOB,10percentage,10board,12graduation,12percentage,12board,CollegeID,CollegeTier,...,MechanicalEngg,ElectricalEngg,TelecomEngg,CivilEngg,conscientiousness,agreeableness,extraversion,nueroticism,openess_to_experience,Salary
0,604399,f,1990-10-22,87.8,cbse,2009,84.0,cbse,6920,1,...,-1,-1,-1,-1,-0.159,0.3789,1.2396,0.1459,0.2889,445000
1,988334,m,1990-05-15,57.0,cbse,2010,64.5,cbse,6624,2,...,-1,-1,-1,-1,1.1336,0.0459,1.2396,0.5262,-0.2859,110000
2,301647,m,1989-08-21,77.33,"maharashtra state board,pune",2007,85.17,amravati divisional board,9084,2,...,-1,-1,260,-1,0.51,-0.1232,1.5428,-0.2902,-0.2875,255000
3,582313,m,1991-05-04,84.3,cbse,2009,86.0,cbse,8195,1,...,-1,-1,-1,-1,-0.4463,0.2124,0.3174,0.2727,0.4805,420000
4,339001,f,1990-10-30,82.0,cbse,2008,75.0,cbse,4889,2,...,-1,-1,-1,-1,-1.4992,-0.7473,-1.0697,0.06223,0.1864,200000


Read the second dataset `Data for Repository Personality and GPA Research.xlsx` and assign it to the variable `medical_df`

In [21]:
medical_df = pd.read_csv('data/raw_data/Data for Repository Personality and GPA Research.csv')

medical_df.head()

Unnamed: 0,No.,Corresponding Year,Gender,Extraversion,Agreeableness,Conscientiousness,Neuroticism,Openness,GPA,Unnamed: 9,Key: Gender 1 = Male; Gender 2 = Female,Unnamed: 11,Unnamed: 12,Unnamed: 13,Unnamed: 14,Unnamed: 15
0,1,2,2,0.725,0.644,0.556,0.95,0.84,3.84,,,,,,,
1,2,2,2,0.4,0.867,0.689,0.625,0.52,2.9,,,,,,,
2,3,2,2,0.7,0.756,0.711,0.6,0.76,2.57,,,,,,,
3,4,2,2,0.575,0.556,0.644,0.55,0.54,2.58,,,,,,,
4,5,2,2,0.575,0.911,0.8,0.625,0.78,3.65,,,,,,,


# Data Cleaning

## Part I: Cleaning the Engineering Graduate Dataset, `engineering_df`

According to Kaggle's summary, there are 34 columns in the dataset. However, our only columns of interest are only `collegeGPA` and the AMCAT Personality test scores labels. We can therefore slice our dataframe to only include our columns of interest.

In [22]:
# Output raw dataset columns
engineering_df.columns

Index(['ID', 'Gender', 'DOB', '10percentage', '10board', '12graduation',
       '12percentage', '12board', 'CollegeID', 'CollegeTier', 'Degree',
       'Specialization', 'collegeGPA', 'CollegeCityID', 'CollegeCityTier',
       'CollegeState', 'GraduationYear', 'English', 'Logical', 'Quant',
       'Domain', 'ComputerProgramming', 'ElectronicsAndSemicon',
       'ComputerScience', 'MechanicalEngg', 'ElectricalEngg', 'TelecomEngg',
       'CivilEngg', 'conscientiousness', 'agreeableness', 'extraversion',
       'nueroticism', 'openess_to_experience', 'Salary'],
      dtype='object')

In [23]:
# Save new dataframe with collegeGPA and the AMCAT Personality test scores labels
engineering_df = engineering_df[['Gender',
                                 'collegeGPA', 
                                 'conscientiousness', 
                                 'agreeableness', 
                                 'extraversion',
                                 'nueroticism', 
                                 'openess_to_experience']]

# Rename columns
engineering_df = engineering_df.set_axis(['gender',
                                          'gpa', 
                                          'conscientiousness', 
                                          'agreeableness', 
                                          'extraversion', 
                                          'neuroticism', 
                                          'openness'], axis=1)

engineering_df.head()

Unnamed: 0,gender,gpa,conscientiousness,agreeableness,extraversion,neuroticism,openness
0,f,73.82,-0.159,0.3789,1.2396,0.1459,0.2889
1,m,65.0,1.1336,0.0459,1.2396,0.5262,-0.2859
2,m,61.94,0.51,-0.1232,1.5428,-0.2902,-0.2875
3,m,80.4,-0.4463,0.2124,0.3174,0.2727,0.4805
4,f,64.3,-1.4992,-0.7473,-1.0697,0.06223,0.1864


Drop rows where ANY column has missing data `NaN`

In [24]:
raw_size = engineering_df.shape
engineering_df = engineering_df.dropna().reset_index(drop=True)

print(f'Original size: {raw_size}')
print(f'Filtered size: {engineering_df.shape}')

Original size: (2998, 7)
Filtered size: (2998, 7)


Notice that the GPA scaling is different with a <= 100.0 scale. These values are most likely percentages, thus we map these values to reflect our accustomed <= 4.0 scale.

In [25]:
gpa_scale = lambda x: (x * 4.0) / 100.0
engineering_df['gpa'] = engineering_df['gpa'].apply(gpa_scale)

engineering_df.head()

Unnamed: 0,gender,gpa,conscientiousness,agreeableness,extraversion,neuroticism,openness
0,f,2.9528,-0.159,0.3789,1.2396,0.1459,0.2889
1,m,2.6,1.1336,0.0459,1.2396,0.5262,-0.2859
2,m,2.4776,0.51,-0.1232,1.5428,-0.2902,-0.2875
3,m,3.216,-0.4463,0.2124,0.3174,0.2727,0.4805
4,f,2.572,-1.4992,-0.7473,-1.0697,0.06223,0.1864


The AMCAT Personality or Big Five scores are inconsistently scaled for this dataset. In addition, there is canonically no scaling for AMCAT exams. Yet, according the [AMCAT's portal](https://www.myamcat.com/help/amcat-scores-and-results/scores), a 'good score' is often weighted on percentile. Therefore, we can wrangle and map our personality scores into percentiles respectively within each column.

In [26]:
big_five = engineering_df.columns[2:]

# Loop through all big five personalities columns and map values to percentiles
for personality in big_five:
    engineering_df[personality] = engineering_df[personality].rank(pct = True)

## Part II: Summary of  Engineering Graduate Cleaned Dataset, `engineering_df`

Written Description of `engineering_df` Column Labels:
- `gender`: Male `m` or female `f`
- `gpa`: Scaled college GPA ( out of 4.0 )
- `conscientiousness`: Big Five Conscientiousness percentile ( out of 1.0 )
- `agreeableness`: Big Five Agreeableness percentile ( out of 1.0 )
- `extraversion`: Big Five Extraversion percentile ( out of 1.0 )
- `nueroticism`: Big Five Nueroticism percentile ( out of 1.0 )
- `openess`: Big Five Openess percentile ( out of 1.0 )

In [27]:
engineering_df.head()

Unnamed: 0,gender,gpa,conscientiousness,agreeableness,extraversion,neuroticism,openness
0,f,2.9528,0.423115,0.572048,0.912608,0.622248,0.637759
1,m,2.6,0.877085,0.416111,0.912608,0.745163,0.401935
2,m,2.4776,0.676451,0.322382,0.957638,0.455971,0.365911
3,m,3.216,0.306371,0.486991,0.606071,0.667445,0.719646
4,f,2.572,0.088893,0.154436,0.125584,0.591561,0.5999


In [28]:
# Export cleaned dataset
if EXPORT_CSV:
    engineering_df.to_csv('data/cleaned_data/engineering_dataset.csv', index=False)

## Part III: Cleaning the Jordanian Medical Students Dataset, `medical_df`

Similarly, we will perform cleaning and wrangling for our Jordanian Medical Students Dataset, starting off with selecting our columns of interest

In [29]:
# Output raw dataset columns
medical_df.columns

Index(['No.', 'Corresponding Year', 'Gender', 'Extraversion', 'Agreeableness',
       'Conscientiousness', 'Neuroticism', 'Openness', 'GPA', 'Unnamed: 9',
       'Key: Gender 1 = Male; Gender 2 = Female', 'Unnamed: 11', 'Unnamed: 12',
       'Unnamed: 13', 'Unnamed: 14', 'Unnamed: 15'],
      dtype='object')

In [30]:
# Save new dataframe with GPA and Big Five personality test scores labels
medical_df = medical_df[['Gender',
                         'GPA', 
                         'Conscientiousness', 
                         'Agreeableness', 
                         'Extraversion',
                         'Neuroticism', 
                         'Openness']]

# Rename columns
medical_df = medical_df.set_axis(['gender',
                                  'gpa', 
                                  'conscientiousness', 
                                  'agreeableness', 
                                  'extraversion', 
                                  'neuroticism', 
                                  'openness'], axis=1)

medical_df.head()

Unnamed: 0,gender,gpa,conscientiousness,agreeableness,extraversion,neuroticism,openness
0,2,3.84,0.556,0.644,0.725,0.95,0.84
1,2,2.9,0.689,0.867,0.4,0.625,0.52
2,2,2.57,0.711,0.756,0.7,0.6,0.76
3,2,2.58,0.644,0.556,0.575,0.55,0.54
4,2,3.65,0.8,0.911,0.575,0.625,0.78


Drop rows where ANY column has missing data `NaN`

In [31]:
raw_size = medical_df.shape
medical_df = medical_df.dropna().reset_index(drop=True)

print(f'Original size: {raw_size}')
print(f'Filtered size: {medical_df.shape}')

Original size: (307, 7)
Filtered size: (307, 7)


For consistency, we need to standardize the `gender` column to either `m` or `f` similar to the `engineering_df` dataframe. Fortunately, the raw `Data for Repository Personality and GPA Research.csv` dataset provided us a key for `gender` labels. 

*Key: Gender 1 = Male; Gender 2 = Female*

In [32]:
medical_df['gender'] = medical_df['gender'].replace(1, 'm')
medical_df['gender'] = medical_df['gender'].replace(2,'f')

This dataset appears to already have <= 4.0 GPA sclaing. Thus, the next step is to convert the Big Five personality column values into percentiles as well for consistency.

In [33]:
# Loop through all big five personalities columns and map values to percentiles
for personality in big_five:
    medical_df[personality] = medical_df[personality].rank(pct = True)

## Part IV: Summary of Jordanian Medical Students Cleaned Dataset, medical_df

Written Description of `medical_df` Column Labels:
- `gender`: Male `m` or female `f`
- `gpa`: Scaled college GPA ( out of 4.0 )
- `conscientiousness`: Big Five Conscientiousness percentile ( out of 1.0 )
- `agreeableness`: Big Five Agreeableness percentile ( out of 1.0 )
- `extraversion`: Big Five Extraversion percentile ( out of 1.0 )
- `nueroticism`: Big Five Nueroticism percentile ( out of 1.0 )
- `openess`: Big Five Openess percentile ( out of 1.0 )

In [34]:
medical_df.head()

Unnamed: 0,gender,gpa,conscientiousness,agreeableness,extraversion,neuroticism,openness
0,f,3.84,0.136808,0.153094,0.806189,0.980456,0.887622
1,f,2.9,0.452769,0.838762,0.035831,0.543974,0.057003
2,f,2.57,0.537459,0.452769,0.726384,0.472313,0.677524
3,f,2.58,0.333876,0.043974,0.324104,0.350163,0.068404
4,f,3.65,0.79316,0.941368,0.324104,0.543974,0.7443


In [35]:
# Export cleaned dataset
if EXPORT_CSV:
    medical_df.to_csv('data/cleaned_data/medical_dataset.csv', index=False)