## Khan Academy Math Student Dashboard: Data Cleaning
We will be creating a student dashboard that displays each student's progress through Khan Academy Math. The goal is to showcase:
- How the student is performing overall and this month?
- What is their assignment completion rate?
- Which topics are students strongest and weakest in?
- The students' percentile in the class?
- Is the student growing?

We want to automate the dashboarding process for each class in Lyceum Village, grades K-8. We will be testing on grade K-1 students' Khan Academy Math first to create functions and a webpage to showcase their progress.

### Import the Data
The data imported is for grade K-1. The students' names are already adjusted to an alias name.

In [1]:
import numpy as np
import pandas as pd

In [2]:
df = pd.read_csv("../Resources/K_1-Math-Anonymous.csv")

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 266 entries, 0 to 265
Data columns (total 12 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   Assignment Name              266 non-null    object 
 1   Student Name                 266 non-null    object 
 2   Score At Due Date            26 non-null     float64
 3   Score Best Ever              30 non-null     float64
 4   Points Possible              30 non-null     float64
 5   Number Of Attempts           266 non-null    int64  
 6   Most Recent Completion Date  30 non-null     object 
 7   Start Date                   266 non-null    object 
 8   Due Date                     266 non-null    object 
 9   Assignment URL               266 non-null    object 
 10  Assignment Type              266 non-null    object 
 11  Due Date (no time)           266 non-null    object 
dtypes: float64(3), int64(1), object(8)
memory usage: 25.1+ KB


Our csv dataset will have 11 variables.

Types of data:
- 3 float: 
    - Score At Due Date
    - Score Best Ever
    - Points Possible
- 1 integer:
    - Number Of Attempts
- 7 string:
    - Assignment Name
    - Student Name
    - Most Recent Completion Date
    - Start Date
    - Due Date
    - Assignment URL
    - Assignment Type

The data type so far matches what they should be.

There are 266 rows total. There are missing data in the following columns:
- Score At Due Date: 240 missing
- Score Best Ever: 224 missing
- Points Possible: 224 missing
- Most Recent Completion Date: 224 missing

They are missing data because some students did not complete their assignments. We will be removing these data in actual student analysis.

### Student Anonymity
We will need to create a student ID column to give each studketn anonymity on their scores when it is being displayed to other parents.
Their ID's will be formatted as so: grade_studentinitial_number <br>

Example: G2_KL_1

#### Create Student ID
We will be automating this system by having it done automatically only once. The application will request the grade level from the user and store the student ID information into a PostgreSQL database.

In [8]:
df['Student Name'].unique()

array(['Ned Ethans', 'Anna Kite', 'April Luna', 'Andy Hong',
       'Abby Nguyen', 'Everett Chase', 'Kyle Anderson', 'Kevin Martelle',
       'Misa Bing', 'Olsen Le', 'Terry Long'], dtype=object)

In [15]:
# User input grade level
stu_grade = input("Enter the student grade level: ")

In [18]:
# Loop through the list of unique students and create their ID
stu_id_list = []
for name in df['Student Name']:
    stu_name = name.split(" ")
    stu_id = "G" + stu_grade + "_" + stu_name[0][0] + stu_name[1][0]
    stu_id_list.append(stu_id)
stu_id_list[:5]

['GK1_NE', 'GK1_AK', 'GK1_AL', 'GK1_AH', 'GK1_AN']

In [20]:
# Add in the numeric value to account for students with the same initials
stu_id_list2 = []
if len(df['Student Name']) == len(stu_id_list):
    for id in stu_id_list:
        # Add 1 at the end because there are no repeated initials
        stu_id_list2.append(id + "1")
else:
    # change the number depending on the student name
    # current list of students do not have repeated initials. Create a dummy list to test this function on
    stu_id_list2
stu_id_list2[:5]

['GK1_NE1', 'GK1_AK1', 'GK1_AL1', 'GK1_AH1', 'GK1_AN1']