## Student Data Exploration

This dataframe contains over 30k entries. 
It includes 
 - Student id
 - Student country
 - Date of registration.

#### Questions

 - Which is the most & least represented country?
     - Could be compared against previous year's data to understand trend and focus resources
     - Understand the relationship between representation here and having sought help in the hub?
 - How is the date of registration spread?
     - Are there different groupings of date of registration for different countries? (Check against Ad campaigns and offers to understand effectiveness)
 

In [1]:
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt

In [73]:
student_info = pd.read_csv('dataSets/365_database/365_student_info.csv')

student_info.head()

student_info.info()

student_info.shape


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35230 entries, 0 to 35229
Data columns (total 3 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   student_id       35230 non-null  int64 
 1   student_country  35217 non-null  object
 2   date_registered  35230 non-null  object
dtypes: int64(1), object(2)
memory usage: 825.8+ KB


(35230, 3)

In this dataframe only the country column has missing values.

In [53]:
student_info_cleaned = student_info.dropna()


Cleaned the original dataframe by removing rows with missing values for country.

In [120]:
country_frequency = (student_info_cleaned['student_country'].value_counts())
country_frequency

IN    6933
US    4768
EG    3003
GB    1748
NG    1718
      ... 
SR       1
RE       1
BI       1
GY       1
BM       1
Name: student_country, Length: 180, dtype: int64

The value_count() function returns a series of unique values and the frequency of their occurrence.

By applying it on the country column 
    <mark>student_info_cleaned['student_country'].value_counts()</mark>
we are able to return a series containing a list of countries and how often they appear in the dataframe.

This series will be contained in a variable which will then be used to create a dataframe.

In [124]:
student_by_countries = pd.DataFrame(country_frequency).reset_index().rename(columns={'index':'Country','student_country':'No,of students'})

student_by_countries


Unnamed: 0,Country,"No,of students"
0,IN,6933
1,US,4768
2,EG,3003
3,GB,1748
4,NG,1718
...,...,...
175,SR,1
176,RE,1
177,BI,1
178,GY,1


The above cell:
 - Creates a dataframe using the series from the preceding cell.
 - Resets the index: It adds the previous index (country name) to the dataframe as a column. and replaces it with a conventional index.
 - Renames the column for quicker understanding.

The dataframe itself contains in descending order countries most represented to least represented. For us, this shows where most of our students population is from, and where the least is from as well.