# Analyzing COVID-19's Effect on Student Engagement for Online Schooling
### Joshua Hess

**TODO:** Draft an introduction talking about how COVID affected online education. Introduce the project, and talk about what will be explored.

Now, let's load in a few of our datasets and take an initial look at them:

In [1]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Import districts and products info
districts = pd.read_csv('/kaggle/input/learnplatform-covid19-impact-on-digital-learning/districts_info.csv')
products = pd.read_csv('/kaggle/input/learnplatform-covid19-impact-on-digital-learning/products_info.csv')

# Print out basic information
print("Districts Dataset:\n")
print(districts.head(5))
print(districts.info())

print("\nProducts Dataset:\n")
print(products.head(5))
print(products.info())

Districts Dataset:

   district_id     state  locale pct_black/hispanic pct_free/reduced  \
0         8815  Illinois  Suburb           [0, 0.2[         [0, 0.2[   
1         2685       NaN     NaN                NaN              NaN   
2         4921      Utah  Suburb           [0, 0.2[       [0.2, 0.4[   
3         3188       NaN     NaN                NaN              NaN   
4         2238       NaN     NaN                NaN              NaN   

  county_connections_ratio    pp_total_raw  
0                [0.18, 1[  [14000, 16000[  
1                      NaN             NaN  
2                [0.18, 1[    [6000, 8000[  
3                      NaN             NaN  
4                      NaN             NaN  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 233 entries, 0 to 232
Data columns (total 7 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   district_id               233 non-null    int64 
 1   state 

Before we can begin a proper analysis, we need to organize our data together. In the `engagement_data` folder, we have `.csv` files of every district, with the district's number being the name of the file. Let's combine all of these CSV's into one overall DataFrame. As we add the data, we'll add an additional column named `district_id` that contains the ID of the district for that particular observation:

In [2]:
# Import libraries for file management
import os
import glob

# Add all engagement_data CSV's into a list
filepath = "/kaggle/input/learnplatform-covid19-impact-on-digital-learning/engagement_data"
files = glob.glob(filepath + "/*.csv")

# Initialize empty list to hold all engagement_data DataFrames
data = []

# For each CSV in 'files', add it to the 'data' list and create a new column 'district_id'
for file in files:
    df = pd.read_csv(file)  # Read in DataFrame
    filename = os.path.splitext(file)  # # Split filename and extension
    df['district_id'] = os.path.basename(filename[0])  # Extract district ID from CSV filename tuple
    data.append(df)  # Add new column to new list

# Concatenate all DataFrames into one 'engagement_data' DataFrame
engagement_data = pd.concat(data)
print(engagement_data.head(10))
print(engagement_data.info())

         time    lp_id  pct_access  engagement_index district_id
0  2020-01-01  92844.0        0.01              0.68        6345
1  2020-01-01  64838.0        0.01              0.68        6345
2  2020-01-01  94058.0        0.00               NaN        6345
3  2020-01-01  26488.0        0.03             26.21        6345
4  2020-01-01  32340.0        0.01              0.11        6345
5  2020-01-01  95731.0        0.20             40.96        6345
6  2020-01-01  92918.0        0.01              4.54        6345
7  2020-01-01  17307.0        0.00               NaN        6345
8  2020-01-01  96255.0        0.01              0.11        6345
9  2020-01-01  83862.0        0.01              0.11        6345
<class 'pandas.core.frame.DataFrame'>
Index: 22324190 entries, 0 to 41427
Data columns (total 5 columns):
 #   Column            Dtype  
---  ------            -----  
 0   time              object 
 1   lp_id             float64
 2   pct_access        float64
 3   engagement_index  f

Now we must clean up our data. Let's begin with our `engagement_data` table. First, let's check how many missing values are in each column:

In [3]:
# Print sum of all missing values
print(engagement_data.isnull().sum())

# Check if all observations with a missing value for 'pct_access' also has a missing value for 'engagement_index'
missing_pct = engagement_data[engagement_data['pct_access'].isna()]
print("\nNumber of observations with missing pct_access: " + str(len(missing_pct.index)))

num_of_null_indexes = str(len(missing_pct[missing_pct['engagement_index'].isna()]))
print("Number of missing values for engagement_index: " + num_of_null_indexes)

time                      0
lp_id                   541
pct_access            13447
engagement_index    5378409
district_id               0
dtype: int64

Number of observations with missing pct_access: 13447
Number of missing values for engagement_index: 13447


Notice above that for every time we had a `pct_access` missing, our `engagement_index` was also missing. We can assume that if the `pct_access` is missing, students likely didn't use any learning products that day, which means that there was no `engagement_index` to be measured. For all of these observations, we'll replace both values with 0:

In [4]:
# Conditional statement for both 'engagement_index' and 'pct_access' being missing
both_missing = (engagement_data['pct_access'].isna()) & (engagement_data['engagement_index'].isna())

# Replace NA's in both columns with 0
engagement_data.loc[both_missing, 'pct_access'] = engagement_data.loc[both_missing, 'pct_access'].fillna(0)
engagement_data.loc[both_missing, 'engagement_index'] = engagement_data.loc[both_missing, 'engagement_index'].fillna(0)

# Check missing values
print(engagement_data.isna().sum())

time                      0
lp_id                   541
pct_access                0
engagement_index    5364962
district_id               0
dtype: int64


For the time being, we'll fill in all other NA's with zero. For `lp_id`, we'll drop those observations because the rows with missing values for `lp_id` only make up about 0.0024% of the total dataset. 

In [5]:
# Drop rows where 'lp_id' is missing
engagement_data = engagement_data.dropna(subset=['lp_id'])

# Fill all other NA's with 0
engagement_data = engagement_data.fillna(0)

# Check the DataFrame to verify missing values are taken care of
print(engagement_data.isna().sum())

time                0
lp_id               0
pct_access          0
engagement_index    0
district_id         0
dtype: int64
