# **Predict students dropout**

#Introduction

Quality education is a fundamental right that every government/educational institution strives to provide. However, dropout rates in schools remain a persistent challenge, influenced by various social, economic, and demographic factors. Hence, there is a need for a comprehensive analysis of school-level dropout patterns to address this issue. By identifying the underlying causes and vulnerable groups, we aim to formulate targeted interventions that can significantly reduce dropout rates and ensure that every child completes their education. Thus, this model can help the government/education institutions.

The purpose of this notebook is to provide an in-depth analysis of student dropout trends in school education. The analysis is based on a dataset called "Predict Students' Dropout and Academic Success - Investigating the Impact of Social and Economic Factors," which was sourced from Kaggle and contributed by thedevastator (https://www.kaggle.com/thedevastator). The dataset contains a variety of attributes that help to explain the factors behind student dropout.

#Overview

Our main goal for this project is to analyze the dropout rates of students in school education, based on our dataset. We will be utilizing the available dataset titled "Predict Students' Dropout and Academic Success - Investigating the Impact of Social and Economic Factors." Even though the dataset may not contain information on schools, areas, or castes, we can still extract valuable insights from the existing attributes.

The aim of this analysis is to provide insights into the factors influencing student dropout rates. The following key aspects will be explored:

1. Demographic Analysis: We will investigate how factors like gender, age at enrollment, marital status, and nationality are correlated with student dropout rates.

2. Economic Factors: The influence of economic factors such as parental occupation, tuition fee payment status, and scholarship eligibility on student dropout rates will be investigated.

3. Academic Performance: We will analyze how students' academic performance, represented by variables like curricular units and evaluations, impacts their likelihood of dropping out.

4. Social and Special Needs: We will explore whether students with educational special needs or those facing unique challenges like displacement or debt are more susceptible to dropout.

5. Macro-economic Factors: We will investigate how broader economic indicators like unemployment rate, inflation rate, and GDP growth relate to dropout rates, as these can indirectly affect education outcomes.

By identifying high-risk groups and understanding the nuanced factors contributing to dropout rates, the government can develop targeted interventions and policies to improve student retention and foster a conducive learning environment.

In the subsequent sections of this notebook, we will delve into data preprocessing, exploratory data analysis, and the development of predictive models to aid in the dropout analysis. While we may not have school-wise, area-wise, or caste-wise information, we will use the available attributes to contribute to the government's efforts in ensuring every child's right to education and reducing dropout rates where possible.

#About the DataSet

This dataset provides a comprehensive view of students enrolled in various undergraduate degrees offered at a higher education institution. It includes demographic data, social-economic factors, and academic performance information that can be used to analyze the possible predictors of student dropout and academic success. This dataset contains multiple disjoint databases consisting of relevant information available at the time of enrollment, such as application mode, marital status, course chosen, and more. Additionally, this data can be used to estimate overall student performance at the end of each semester by assessing credited/enrolled/evaluated/approved curricular units and their respective grades. Finally, we have unemployment rate, inflation rate, and GDP from the region, which can help us further understand how economic factors play into student dropout rates or academic success outcomes. This powerful analysis tool will provide valuable insight into what motivates students to stay in school or abandon their studies in a wide range of disciplines such as agronomy, design, education, nursing, journalism, management, social service, or technologies.

#Columns

1. Marital status -	The marital status of the student. (Categorical)
2. Application mode -	The method of application used by the student. (Categorical)
3. Application order - The order in which the student applied. (Numerical)
4. Course -	The course taken by the student. (Categorical)
5. Daytime/evening attendance -	Whether the student attends classes during the day or in the evening. (Categorical)
6. Previous qualification -	The qualification obtained by the student before enrolling in higher education. (Categorical)
7. Nationality -	The nationality of the student. (Categorical)
8. Mother's qualification -	The qualification of the student's mother. (Categorical)
9. Father's qualification -	The qualification of the student's father. (Categorical)
10. Mother's occupation - The occupation of the student's mother. (Categorical)
11. Father's occupation - The occupation of the student's father. (Categorical)
12. Displaced -	Whether the student is a displaced person. (Categorical)
13. Educational special needs -	Whether the student has any special educational needs. (Categorical)
14. Debtor -	Whether the student is a debtor. (Categorical)
15. Tuition fees up to date -	Whether the student's tuition fees are up to date. (Categorical)
16. Gender -	The gender of the student. (Categorical)
17. Scholarship holder -	Whether the student is a scholarship holder. (Categorical)
18. Age at enrollment -	The age of the student at the time of enrollment. (Numerical)
19. International -	Whether the student is an international student. (Categorical)
20. Curricular units 1st sem (credited) -	The number of curricular units credited by the student in the first semester. (Numerical)
21. Curricular units 1st sem (enrolled) -	The number of curricular units enrolled by the student in the first semester. (Numerical)
22. Curricular units 1st sem (evaluations) -	The number of curricular units evaluated by the student in the first semester. (Numerical)
23. Curricular units 1st sem (approved) -	The number of curricular units approved by the student in the first semester. (Numerical)

#Problem statement and Objective

The task is to develop a machine learning model that can accurately classify students as potential dropouts based on their content.

#Solution and Analysis

# Prepare the tools

Import the required tools

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
from IPython.display import Markdown
import joblib

Setup plotly


This function is required for plotly to run properly in colab without it sometimes the plots don't render properly.

To avoid code repeatability, enable plotly in cell function is created


In [None]:
def enable_plotly_in_cell():
  import IPython
  from plotly.offline import init_notebook_mode
  display(IPython.core.display.HTML('''<script src="/static/components/requirejs/require.js"></script>'''))
  init_notebook_mode(connected=False)

# Load data

Read dataset from csv file

In [None]:
dataset =  pd.read_csv("student dataset.csv")

# Understanding the data

Descriptive Analysis

In [None]:
dataset.shape

In [None]:
print(f'We see the dataset has {dataset.shape[0]} observations and over {dataset.shape[1]} features.')

Review the data and sample data

In [None]:
dataset.head()

In [None]:
dataset.sample(5)

# Check datatype and information regarding all different features

We want to look at the datatypes and check to see if they were interpreted correctly.

In [None]:
dataset.info()

The results show that all columns have numerical datatypes except the Target column

Learning the data mathematically

In [None]:
dataset.describe(include = 'all')

# Data Pre-processing

From the information we reviewed we could see there are no duplicates. However, we can check again and handle it if required.

Look for missing data

In [None]:
dataset.isnull().sum()

We notice there are no missing values at this stage

Check for duplicated records

In [None]:
print(f'There are {dataset.duplicated().sum()} duplicated rows in the dataset.')

Rename columns of dataset

This is done to make it easier to process the data. At the same time, it helps with the understanding and analysis of the data

In [None]:
dataset.rename(columns = {'Nacionality' : 'Nationality', 'Age at enrollment':'Age'}, inplace = True)

In [None]:
dataset.info()

We notice the columns have been updated correctly

# Exploratory Data Analysis

Most of the variables in the dataset are already converted to numerical format. However, for our analysis, we will revert some columns to original form

Create a copy of the dataset for exploratory data analysis

In [None]:
visualization_dataset = dataset.copy()

Look acrosss all the unique records in the dataset

This will also helps when building the web application

In [None]:
for i in dataset.columns:
  dis = len(dataset[i].unique())
  print(f"{i} - {dis}")

# Evaluate Target Variable

We review the values in the target variable

In [None]:
visualization_dataset['Target'].unique()

Based on the target column we can understand the following -

1. Enrolled - The student is currently enrolled in the program
2. Graduate - The student has graduated
3. Dropout - The student has dropped out

Get the counts for the Target Column

In [None]:
visualization_dataset['Target'].value_counts()

In [None]:
enable_plotly_in_cell()

x = visualization_dataset['Target'].value_counts().index
y = visualization_dataset['Target'].value_counts().values

df = pd.DataFrame({
    'Target': x,
    'Count_T' : y
})

fig = px.pie(df,
             names ='Target',
             values ='Count_T',
             width=800,
             height=500,
            title='How many dropouts, enrolled & graduates are there in Target column')

fig.update_traces(labels=['Graduate','Dropout','Enrolled'], hole=0.4,textinfo='value+label', pull=[0,0.2,0.1])
fig.show()

In [None]:
enable_plotly_in_cell()

fig = px.histogram(visualization_dataset, x='Target', color = 'Target',
                   opacity = 1, barmode = 'overlay',
                    width = 800, height = 500,
                   color_discrete_sequence=px.colors.qualitative.G10)
fig.update_layout(title='Target Count')
fig.show()

Observation

1. The graph shows that most of the students are Graduates
2. There are more dropouts than the students who have enrolled
3. We can see more students are graduating then droppoing out. However, the dropout number of student is high.

# Distribution of age of students at the time of enrollment

We review the values in the Age feature

In [None]:
visualization_dataset['Age'].unique()

Let's get the counts for the Age feature

In [None]:
visualization_dataset['Age'].value_counts()

Let's review with help of some visuals how the age of student impacts the dropout ratio

In [None]:
enable_plotly_in_cell()

fig = px.histogram(visualization_dataset, x='Age', color = 'Target',
                   opacity = 1, barmode = 'overlay',
                    width = 800, height = 500,
                   color_discrete_sequence=px.colors.qualitative.G10)
fig.update_layout(title='Age distribution of students at time of enrollment')
fig.show()

In [None]:
enable_plotly_in_cell()
plt.figure(figsize=(10, 6))
sns.boxplot(x='Target', y='Age', data=visualization_dataset)
plt.xlabel('Target')
plt.ylabel('Age')
plt.title('Relationship between Age and Target')
plt.show()

Observation

1. The distribution shows that a high percentage of students are in in their late teens and early 20s
2. We can also notice that there is an increase in dropout rate from mid 20s to early 30s

At this stage, as we are going to be analyzing other features. We create functions to avoid repability of code.

The following two functions were created -

1. Get Data Dictionaries - this creates a dictionary in  related to the target column
2. Create Pie Charts - this creates pie chart based on the dictionary

In [None]:
def get_data_dictionaries(category_list, dfcolumn_name, target_col, dictionary_list):
  for each_category in category_list:
    dictionary = dict(visualization_dataset[visualization_dataset[dfcolumn_name]== each_category][target_col].value_counts())
    dictionary_list.append(dictionary)
  return dictionary_list

In [None]:
def create_pie(dictionary_list, trace_list, colors_list):
  for dictionary in dictionary_list:
    trace = go.Pie(values = list(dictionary.values()), labels = list(dictionary.keys()),
           textposition = 'inside', textinfo='percent+label',
           marker=dict(colors=colors_list))
    trace_list.append(trace)
  return trace_list

# Distribution of Gender at the time of enrollment

We review the values in the Gender feature

In [None]:
visualization_dataset['Gender'].unique()

From the dataset we know the the following:

*   1 : Male
*   2 : Female

Let's get the counts for the Gender Feature

In [None]:
visualization_dataset['Gender'].value_counts()

For visualization, we will convert the numerical values back to the categorical values

In [None]:
visualization_dataset['Gender'] = visualization_dataset['Gender'].map({1:'Male', 0:'Female'})

Getting dictionaries for genders

In [None]:
genders = visualization_dataset['Gender'].unique()
genders_dictionaries = get_data_dictionaries(genders, 'Gender', 'Target', [])
genders_dictionaries

Let's review with help of some visuals how the gender of student impacts the dropout ratio

In [None]:
enable_plotly_in_cell()

fig = px.histogram(visualization_dataset, x='Gender',
                   opacity = 1, barmode = 'overlay',
                   width = 800, height = 500,
                   color_discrete_sequence=px.colors.qualitative.G10)

fig.update_layout(title='Gender Percentage Distribution')
fig.show()

In [None]:
enable_plotly_in_cell()

fig = px.pie(values = visualization_dataset['Gender'].value_counts(),
            title='Gender percentage split',
            width = 800, height = 500,
            color_discrete_sequence=["red", "green", "blue", "goldenrod", "magenta"]
            )

fig.update_traces(labels=['Female','Male'], hole=0.4,textinfo='percent+label', pull=[0,0.2,0.1])
fig.show()

Creating subplots for Gender distribution of students based on Target column

In [None]:
enable_plotly_in_cell()

fig = make_subplots(rows=1, cols=2, subplot_titles=['Male Student Distribution','Female Student Distribution'],
                    specs=[[{'type': 'pie'},{'type': 'pie'}]])

traces = []
go.Pie(values = visualization_dataset['Gender'].value_counts(), labels = ['Female', 'Male'],
                            textposition = 'inside', textinfo='percent+label')

create_pie(genders_dictionaries, traces, ['midnightblue', 'tomato', 'lightseagreen'])

fig.add_trace(traces[0], row=1, col=1)
fig.add_trace(traces[1], row=1, col=2)

fig.update_layout(height=500, width=800,
                  title='Gender Distribution with Target',
                  showlegend = False,
                  font=dict(size=14))

fig.show()


Observation

1. There are higher number of female students compared to male students
2. Noticed there is a higher number of male student dropouts compare to female

#Distribution of Attendance

Looking at the attendance feature. It tells what time the student attends their classes

We review the values in the Daytime/evening attendance feature

In [None]:
visualization_dataset['Daytime/evening attendance'].unique()

From the dataset we know the the following:

*   1 : Daytime attendance
*   2: Nighttime attendance

Let's get the counts for the Daytime/evening attendance feature

In [None]:
visualization_dataset['Daytime/evening attendance'].value_counts()

For visualization, converting values to cateogrical values

In [None]:
visualization_dataset['Daytime/evening attendance'] = visualization_dataset['Daytime/evening attendance'].map({1:'Daytime attendance', 0: 'Nightime attendance'})

Getting dictionaries for Attendance

In [None]:
attendance = visualization_dataset['Daytime/evening attendance'].unique()
attendance_dictionaries = get_data_dictionaries(attendance, 'Daytime/evening attendance', 'Target', [])
attendance_dictionaries

Let's review with help of some visuals how Daytime/evening attendance of student impacts the dropout ratio

In [None]:
enable_plotly_in_cell()

fig = px.histogram(visualization_dataset, x='Daytime/evening attendance',
                   opacity = 1, barmode = 'overlay',
                   width = 800, height = 500,
                   color_discrete_sequence=px.colors.qualitative.G10)

fig.update_layout(title='Attendance Percentage Distribution')
fig.show()

In [None]:
enable_plotly_in_cell()

fig = px.pie(values = visualization_dataset['Daytime/evening attendance'].value_counts(),
            title='Attedance percentage split',
            width = 800, height = 500,
            color_discrete_sequence=["red", "green", "blue", "goldenrod", "magenta"]
            )

fig.update_traces(labels=['Daytime attendance','Nightime attendance'], hole=0.4,textinfo='percent+label', pull=[0,0.2,0.1])
fig.show()

Creating subplots for Attedance distribution of students based on Target column

In [None]:
enable_plotly_in_cell()

fig = make_subplots(rows=1, cols=2, subplot_titles=['Daytime Student Distribution','Nighttime Student Distribution'],
                    specs=[[{'type': 'pie'},{'type': 'pie'}]])

traces = []
go.Pie(values = visualization_dataset['Daytime/evening attendance'].value_counts(), labels = ['Nightime attendance', 'Daytime attendance'],
                            textposition = 'inside', textinfo='percent+label')

create_pie(attendance_dictionaries, traces, ['midnightblue', 'tomato', 'lightseagreen'])

fig.add_trace(traces[0], row=1, col=1)
fig.add_trace(traces[1], row=1, col=2)

fig.update_layout(height=500, width=800,
                  title='Attendance Distribution with Target',
                  showlegend = False,
                  font=dict(size=14))

fig.show()


Observation

1. Majority of the students attend daytime classes
2. There is an increase in dropout of students who attend nighttime classes

# Evaluating Courses

We review the values in the Course feature

In [None]:
visualization_dataset['Course'].unique()

Let's get the counts for the Course feature

In [None]:
visualization_dataset['Course'].value_counts()

From the dataset we know the the following:

*   1 : Biofuel Production Technologies
*   2: Animation and Multimedia Design
*   3: Social Service (evening attendance)
*   4: Agronomy
*   5: Communication Design
*   6: Veterinary Nursing
*   7: Informatics Engineering
*   8: Equiniculture
*   9: Management
*   10: Social Service
*   11: Tourism
*   12: Nursing
*   13: Oral Hygiene
*   14: Advertising and Marketing Management
*   15: Journalism and Communication
*   16: Basic Education
*   17: Management (evening attendance)

For visualization, converting values to cateogrical values

In [None]:
visualization_dataset['Course'] = visualization_dataset['Course'].map({1: 'Biofuel Production Technologies',
 2: 'Animation and Multimedia Design', 3: 'Social Service (evening attendance)',
 4: 'Agronomy', 5: 'Communication Design', 6: 'Veterinary Nursing',
 7: 'Informatics Engineering', 8: 'Equiniculture', 9: 'Management',
 10: 'Social Service', 11: 'Tourism', 12: 'Nursing', 13: 'Oral Hygiene',
 14: 'Advertising and Marketing Management', 15: 'Journalism and Communication',
 16: 'Basic Education', 17: 'Management (evening attendance)'})

Let's review with help of some visuals how Course of student impacts the dropout ratio

In [None]:
enable_plotly_in_cell()

fig = px.histogram(visualization_dataset, y='Course', color='Target',
                   width = 900, height = 600,  text_auto=True,
                   color_discrete_sequence=px.colors.qualitative.G10)

fig.update_layout(title='Course Distribution by Target')
fig.show()

Observation

1. Management (evening attendance) has the highest number of dropouts
2. Nurising has a high number of dropouts however the overall ratio of dropout is lower compared to other corses

# Evaluating Martial Status

We review the values in the Martial status feature

In [None]:
visualization_dataset['Marital status'].unique()

Let's get the counts for the Marital status feature

In [None]:
visualization_dataset['Marital status'].value_counts()

From the dataset we know the the following:

*   1: Single
*   2: Married
*   3: Widower
*   4: Divorced
*   5: Facto union
*   6: Legally Separated

For visualization, converting values to cateogrical values

In [None]:
visualization_dataset['Marital status'] = visualization_dataset['Marital status'].map({1:'Single', 2: 'Married',
                                                             3: 'Widower', 4: 'Divorced',
                                                             5: 'Facto union', 6: 'Legally Separated'})

Let's review with help of some visuals how Marital status of student impacts the dropout ratio

In [None]:
enable_plotly_in_cell()

fig = px.histogram(visualization_dataset, x='Marital status',
                   opacity = 1, barmode = 'overlay',
                   width = 800, height = 500,
                   color_discrete_sequence=px.colors.qualitative.G10)

fig.update_layout(title='Martial status Distribution')
fig.show()

In [None]:
enable_plotly_in_cell()

fig = px.pie(values = visualization_dataset['Marital status'].value_counts(),
            title='Martial status percentage split',
            width = 800, height = 500,
            color_discrete_sequence=["red", "green", "blue", "goldenrod", "magenta",'yellow']
            )

fig.update_traces(labels=['Single','Married','Divorced','Facto Union','Legally Separated','Widower'],textposition='inside',showlegend=True, hole=0.4,textinfo='percent', pull=[0,0.2,0.1])
fig.show()

Getting dictionaries for Martial Status

In [None]:
status = visualization_dataset['Marital status'].unique()
status_dictionaries = get_data_dictionaries(status, 'Marital status', 'Target', [])
status_dictionaries

At this point, since we are going through detailed plots. Let's create a function to avoid repability of code. The following function is created to create sub plot. It generates the list of subplots based on the input data

In [None]:
def create_sub_plots(chart_to_plot, row, col):
  cols = chart_to_plot * col
  rows = [cols] * row
  return rows

Creating subplots for Martial status distribution of students based on Target column

In [None]:
enable_plotly_in_cell()

fig = make_subplots(rows=2, cols=3, subplot_titles = status, specs= create_sub_plots([{'type':'pie'}], 2,3))

traces = []

create_pie(status_dictionaries, traces, ["salmon", "lightblue", "lightgreen"])

fig.add_trace(traces[0], row=1, col=1)
fig.add_trace(traces[1], row=1, col=2)
fig.add_trace(traces[2], row=1, col=3)
fig.add_trace(traces[3], row=2, col=1)
fig.add_trace(traces[4], row=2, col=2)
fig.add_trace(traces[5], row=2, col=3)

fig.update_layout(height=800, width=1000, title='Martial Status Distribution with Target',
                  showlegend = False,
                  font=dict(size=14))

fig.show()


Observation

1. Majority of the students are single
2. Legally separated and Married students have a high dropout rate

# Analyizing Features with Yes/No selections

From the dataset we know the the following:

1. Displaced has values

*   1: Yes
*   2: No

2. Educational special needs

*   1: Yes
*   2: No

3. Displaced has values

*   1: Yes
*   2: No

4. Tuition fees up to date

*   1: Yes
*   2: No

5. Scholarship holder

*   1: Yes
*   2: No

6. International

*   1: Yes
*   2: No

For visualization, converting values to cateogrical values

In [None]:
data_conversion = ['Displaced', 'Educational special needs','Debtor', 'Tuition fees up to date', 'Scholarship holder', 'International']
for i in data_conversion:
    visualization_dataset[i] = visualization_dataset[i].map({1:'Yes', 0: 'No'})

Getting dictionaries for Displaced feature

In [None]:
displaced = visualization_dataset['Displaced'].unique()
displaced_dictionaries = get_data_dictionaries(displaced, 'Displaced', 'Target', [])
displaced_dictionaries

Getting dictionaries for Educational special needs feature

In [None]:
educational_special_needs = visualization_dataset['Educational special needs'].unique()
educational_special_needs_dictionaries = get_data_dictionaries(educational_special_needs, 'Educational special needs', 'Target', [])
educational_special_needs_dictionaries

Getting dictionaries for Debtor feature

In [None]:
debtor = visualization_dataset['Debtor'].unique()
debtor_dictionaries = get_data_dictionaries(debtor, 'Debtor', 'Target', [])
debtor_dictionaries

Getting dictionaries for Tuition fees up to date feature

In [None]:
tuition_fees_up_to_date = visualization_dataset['Tuition fees up to date'].unique()
tuition_fees_up_to_date_dictionaries = get_data_dictionaries(tuition_fees_up_to_date, 'Tuition fees up to date', 'Target', [])
tuition_fees_up_to_date_dictionaries

Getting dictionaries for Scholarship holder feature

In [None]:
scholarship_holder = visualization_dataset['Scholarship holder'].unique()
scholarship_holder_dictionaries = get_data_dictionaries(scholarship_holder, 'Scholarship holder', 'Target', [])
scholarship_holder_dictionaries

Getting dictionaries for International feature

In [None]:
international = visualization_dataset['International'].unique()
international_dictionaries = get_data_dictionaries(international, 'International', 'Target', [])
international_dictionaries

Creating subplots for Displayed distribution of students based on Target column

In [None]:
enable_plotly_in_cell()

fig = make_subplots(rows=1, cols=2, subplot_titles=['Displaced', 'Not Displaced'],
                    specs=[[{'type': 'pie'},{'type': 'pie'}]])

traces = []
go.Pie(values = visualization_dataset['Displaced'].value_counts(), labels = ['Displaced', 'Not Displaced'],
                            textposition = 'inside', textinfo='percent+label')

create_pie(displaced_dictionaries, traces, ['pink', 'violet', 'beige'])

fig.add_trace(traces[0], row=1, col=1)
fig.add_trace(traces[1], row=1, col=2)

fig.update_layout(height=500, width=800,
                  title='Displaced Distribution with Target',
                  showlegend = False,
                  font=dict(size=14))

fig.show()

Creating subplots for Debtor distribution of students based on Target column

In [None]:
enable_plotly_in_cell()

fig = make_subplots(rows=1, cols=2, subplot_titles=['Have Debt', 'Does not have debt'],
                    specs=[[{'type': 'pie'},{'type': 'pie'}]])

traces = []
go.Pie(values = visualization_dataset['Debtor'].value_counts(), labels = ['Does not have debt','Have Debt'],
                            textposition = 'inside', textinfo='percent+label')

create_pie(debtor_dictionaries, traces, ['pink', 'violet', 'beige'])

fig.add_trace(traces[0], row=1, col=1)
fig.add_trace(traces[1], row=1, col=2)

fig.update_layout(height=500, width=800,
                  title='Debtor Distribution with Target',
                  showlegend = False,
                  font=dict(size=14))

fig.show()

Creating subplots for Educational special needs distribution of students based on Target column

In [None]:
enable_plotly_in_cell()

fig = make_subplots(rows=1, cols=2, subplot_titles=['Does not need special educational care', 'Needs special educational care'],
                    specs=[[{'type': 'pie'},{'type': 'pie'}]])

traces = []
go.Pie(values = visualization_dataset['Educational special needs'].value_counts(), labels = ['Does not need special educational care', 'Needs special educational care'],
                            textposition = 'inside', textinfo='percent+label')

create_pie(educational_special_needs_dictionaries, traces, ['pink', 'violet', 'beige'])

fig.add_trace(traces[0], row=1, col=1)
fig.add_trace(traces[1], row=1, col=2)

fig.update_layout(height=500, width=800,
                  title='Educational special needs Distribution with Target',
                  showlegend = False,
                  font=dict(size=14))

fig.show()

Creating subplots for Tuition fees up to date distribution of students based on Target column

In [None]:
enable_plotly_in_cell()

fig = make_subplots(rows=1, cols=2, subplot_titles=['Fees up to date', 'Fees not up to date'],
                    specs=[[{'type': 'pie'},{'type': 'pie'}]])

traces = []
go.Pie(values = visualization_dataset['Tuition fees up to date'].value_counts(), labels = ['Fees up to date', 'Fees not up to date'],
                            textposition = 'inside', textinfo='percent+label')

create_pie(tuition_fees_up_to_date_dictionaries, traces, ['pink', 'violet', 'beige'])

fig.add_trace(traces[0], row=1, col=1)
fig.add_trace(traces[1], row=1, col=2)

fig.update_layout(height=500, width=800,
                  title='Tuition fees up to date Distribution with Target',
                  showlegend = False,
                  font=dict(size=14))

fig.show()

Creating subplots for Scholarship holder distribution of students based on Target column

In [None]:
enable_plotly_in_cell()

fig = make_subplots(rows=1, cols=2, subplot_titles=['Does not have scholarship','Has Scholarship'],
                    specs=[[{'type': 'pie'},{'type': 'pie'}]])

traces = []
go.Pie(values = visualization_dataset['Scholarship holder'].value_counts(), labels = ['Does not have scholarship', 'Scholarship'],
                            textposition = 'inside', textinfo='percent+label')

create_pie(scholarship_holder_dictionaries, traces, ['lightindigo', 'lightpink', 'lightred'])

fig.add_trace(traces[0], row=1, col=1)
fig.add_trace(traces[1], row=1, col=2)

fig.update_layout(height=500, width=800,
                  title='Scholarship Distribution with Target',
                  showlegend = False,
                  font=dict(size=14))

fig.show()

Creating subplots for International distribution of students based on Target column

In [None]:
enable_plotly_in_cell()

fig = make_subplots(rows=1, cols=2, subplot_titles=['Not International', 'International'],
                    specs=[[{'type': 'pie'},{'type': 'pie'}]])

traces = []
go.Pie(values = visualization_dataset['International'].value_counts(), labels = ['Not International', 'International'],
                            textposition = 'inside', textinfo='percent+label')

create_pie(international_dictionaries, traces,['lightindigo', 'lightpink', 'lightred'])

fig.add_trace(traces[0], row=1, col=1)
fig.add_trace(traces[1], row=1, col=2)

fig.update_layout(height=500, width=800,
                  title='International Distribution with Target',
                  showlegend = False,
                  font=dict(size=14))

fig.show()

Observation

1. Students who are not displaced have a higher dropout
2. Students who have debt have a much higher dropout ratio
3. The ratio of dropout b/w students who need special care in comparison with who don't doesn't vary by much
4. Students who don't have their fees paid show higher dropouts
5. Students who have a scholarship show much less dropouts compared to the one's who don't have a scholarship
6. International students have lower dropout ratio than non-international students. Howver, the difference is not huge

# Analyzing features related to Economic factors


From the dataset we know the following features related to ecnomic factos:

1. Inflation rate
2. Unemployment rate
3. GDP






Let's review with help of some visuals how Economic factors of student impacts the dropout ratio

In [None]:
enable_plotly_in_cell()

fig = px.histogram(visualization_dataset, x='Inflation rate',
                   opacity = 1, barmode = 'group', color='Target',
                   width = 800, height = 500,
                   color_discrete_sequence=px.colors.qualitative.G10)

fig.update_layout(title='Inflation rate % Distribution')
fig.show()

In [None]:
enable_plotly_in_cell()

fig = px.histogram(visualization_dataset, x='Unemployment rate',
                   opacity = 1, barmode = 'group', color='Target',
                   width = 800, height = 500,
                   color_discrete_sequence=px.colors.qualitative.G10)

fig.update_layout(title='Unemployment rate % Distribution')
fig.show()

In [None]:
enable_plotly_in_cell()

fig = px.histogram(visualization_dataset, x='GDP',
                   opacity = 1, barmode = 'group', color='Target',
                   width = 800, height = 500,
                   color_discrete_sequence=px.colors.qualitative.G10)

fig.update_layout(title='GDP % Distribution')
fig.show()

Oberseation

1. Based on the analysis we can't find a pattern with the ecnonmic factos

# Feature Selection

In order to choose the features we first need to check the correlation between features and identify the impact

Based on our initial we know only Target column is non-numeric which we can convert

Before we find the correlation. Since Target column is a output column we need it in numeric form so that we can find it's correlation with others

In [None]:
dataset['Target'] = dataset['Target'].map({
    'Dropout':0,
    'Graduate':1,
    'Enrolled':2
})

In [None]:
dataset['Target']

In [None]:
dataset.dtypes

Now all our features are of numeric form

Let's find the correlation of Target with all other numeric columns

In [None]:
dataset.corr()['Target']

Let's find the correlation of Target with all other numeric columns

In [None]:
enable_plotly_in_cell()
fig = px.imshow(dataset)
fig.show()

In [None]:
plt.figure(figsize=(30, 30))
sns.heatmap(dataset.corr() , annot=True, cmap='coolwarm', linewidths=0.5)
plt.title('Correlation Heatmap')
plt.show()

Observation

The following features can be removed due to low correlation to the taarget variable.

1. Nationality: its correlation is very close to zero (-0.0047),  it may not have a significant impact on the target variable.

2. Mother's qualification: a correlation of -0.038 it appears to have a weak relationship with the target variable.

3. Father's qualification:  a correlation of 0.00033,it seems to have little influence on the target variable.

4. Educational special needs: a low correlation of -0.0074 it may not strongly affect the target variable.

5. International: a low correlation of 0.0039 should have minimal impact on the target variable.

6. Curricular units 1st sem (without evaluations): It has a correlation of -0.069, which is relatively low compared to other columns related to curricular units.

7. Unemployment rate: This column's correlation of 0.0086 indicates a weak relationship with the target variable.

8. Inflation rate: with a low correlation of -0.027, it has a relatively low impact on the target variable.

These columns have low absolute correlation values and may not provide significant predictive power for your target variable Target and can be dropped

In [None]:
new_dataset = dataset.copy()
new_dataset = new_dataset.drop(columns=['Nationality',
                                  'Mother\'s qualification',
                                  'Father\'s qualification',
                                  'Educational special needs',
                                  'International',
                                  'Curricular units 1st sem (without evaluations)',
                                  'Unemployment rate',
                                  'Inflation rate'], axis=1)

In [None]:
new_dataset.info()

#Top 10 Features with Highest Correlation to Target

In [None]:
enable_plotly_in_cell()
correlations = new_dataset.corr()['Target']
top_10_features = correlations.abs().nlargest(10).index
top_10_corr_values = correlations[top_10_features]

plt.figure(figsize=(10, 11))
plt.bar(top_10_features, top_10_corr_values)
plt.xlabel('Features')
plt.ylabel('Correlation with Target')
plt.title('Top 10 Features with Highest Correlation to Target')
plt.xticks(rotation=45)
plt.show()

We can further remove the following columns due to their high correlations

1. Mother's occupation
2. Father's occupation
3. Curricular units 1st sem (credited)           
4. Curricular units 1st sem (evaluations)   
5. Curricular units 2nd sem (credited)           
6. Curricular units 2nd sem (evaluations)  
7. Curricular units 2nd sem (without evaluations)
8. GDP

In [None]:
final_dataset = new_dataset.copy()
final_dataset = final_dataset.drop(columns=['Mother\'s occupation',
                                  'Father\'s occupation',
                                  'Curricular units 1st sem (credited)',
                                  'Curricular units 1st sem (evaluations)',
                                  'Curricular units 2nd sem (credited)',
                                  'Curricular units 2nd sem (evaluations)',
                                  'Curricular units 2nd sem (without evaluations)',
                                  'GDP'], axis=1)

In [None]:
final_dataset.info()

This is the final dataset we have after the feature selection process.

Let's review the values across all features in our final dataset.

We will also need this information while building the web application

In [None]:
for i in final_dataset.columns:
  print(i,set(final_dataset[i].values))

# Data Processing

Since we need to know who graduate and dropout so we do not need enrolled.

Hence, we can filter out the data for enrolled.

In [None]:
final_dataset=final_dataset[final_dataset.Target!=2]
final_dataset

This is our updated final dataset

# Prepare the data

Using the final dataset we got after the analysis, we will prepare the data.

In [None]:
X = final_dataset.drop('Target', axis=1)
y = final_dataset['Target']
final_dataset.head()

Splitting the data into Training & Testing Data

In [None]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2, random_state=40)

# Modelling

Since this is a classifcation problem where we need to predict based on a target value which comprises of 0 (Dropout) and 1 (Graduate). We will be using the following techniques

#Evaluating with Descision Tree

Train a Descision Tree Model

In [None]:
dtree = DecisionTreeClassifier(random_state=0)
dtree.fit(X_train,y_train)

Predict target values for Test data

In [None]:
y_pred_dtree = dtree.predict(X_test)

In [None]:
dtree_confusion_matrix = confusion_matrix(y_test, y_pred_dtree)

Evaluate the model's accuracy

In [None]:
dtree_accuracy = round(accuracy_score(y_test, y_pred_dtree), 3)
print(f'Accuracy of Decision tree model is {dtree_accuracy * 100}%')

Print and review the classification report for the model

In [None]:
print(classification_report(y_test,y_pred_dtree))

#Evaluate with Random Forest

Train a Random Forest Model

In [None]:
rf = RandomForestClassifier(random_state=2)
rf.fit(X_train,y_train)

Predict target values for Test data

In [None]:
y_pred_rf = rf.predict(X_test)

In [None]:
rf_confusion_matrix = confusion_matrix(y_test, y_pred_rf)

Evaluate the model's accuracy

In [None]:
rf_accuracy = round(accuracy_score(y_test, y_pred_rf), 3)
print(f'Accuracy of Random Forest model is {rf_accuracy * 100}%')

Print and review the classification report for the model

In [None]:
print(classification_report(y_test,y_pred_rf))

#Evaluate with K-Nearest Neighbours

In [None]:
knn = KNeighborsClassifier()

Use elbow method to find optimal K value

In [None]:
accuracy = []
for i in range(1,21):
  knn_model = KNeighborsClassifier(n_neighbors=i)
  knn_model.fit(X_train, y_train)
  y_pred_knn = knn_model.predict(X_test)
  accuracy.append(accuracy_score(y_test, y_pred_knn))

In [None]:
plt.figure(figsize=(10,6))
plt.plot(range(1,21), accuracy, color='blue', linestyle='dashed', marker='o', markerfacecolor='red', markersize=10)
plt.title('Accuracy vs. K Value')
plt.xlabel('K')
plt.ylabel('Accuracy Score')

In [None]:
k = accuracy.index(max(accuracy)) + 1

In [None]:
Markdown(f"""
#### From the result we can determine that the optimal k- value with the highest score {k}""")

Train a K-Nearest Neighbours Model

In [None]:
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X_train, y_train)

Predict target values for Test data

In [None]:
y_pred_knn = knn.predict(X_test)

In [None]:
knn_confusion_matrix = confusion_matrix(y_test, y_pred_knn)

In [None]:
knn_accuracy = round(accuracy_score(y_test, y_pred_knn), 3)
print(f'Accuracy of KNN Model is {knn_accuracy * 100}%')

Print and review the classification report for the model

In [None]:
print(classification_report(y_test,y_pred_knn))

#Evaluate with Support Vector Machines

Iterate to see which kernel gives the best result

In [None]:
kernels = ['linear', 'poly', 'rbf', 'sigmoid']
score_list = {}
scores = 0

for k in kernels:
    svm = SVC(kernel= k)
    svm.fit(X_train, y_train)
    f_score = svm.score(X_test, y_test)
    score_list.update({k: f_score})

score_list

Train a Support Vector Machines

In [None]:
svm = SVC(kernel = 'linear')
svm.fit(X_train, y_train)

Predict target values for Test data

In [None]:
y_pred_svm = svm.predict(X_test)

In [None]:
svm_confusion_matrix = confusion_matrix(y_test, y_pred_svm)

Evaluate the model's accuracy

In [None]:
svm_accuracy = round(accuracy_score(y_test, y_pred_svm), 3)
print(f'Accuracy of Support Vector Machine model is {svm_accuracy * 100}%')

Print and review the classification report for the model

In [None]:
print(classification_report(y_test,y_pred_svm))

#Evaluate with Logistic Regression

Train a Logistic Regression Model

In [None]:
lr = LogisticRegression(solver='lbfgs', max_iter=1000)
lr.fit(X_train,y_train)

Predict target values for Test data

In [None]:
y_pred_lr = lr.predict(X_test)

In [None]:
lr_confusion_matrix = confusion_matrix(y_test, y_pred_lr)

Evaluate the model's accuracy

In [None]:
lr_accuracy = round(accuracy_score(y_test, y_pred_lr), 3)
print(f'Accuracy of Logistic Regression model is {lr_accuracy * 100}%')

Print and review the classification report for the model

In [None]:
print(classification_report(y_test,y_pred_lr))

#Reviewing Confusion Matrix based on Each Model

Plot bar graph with each model

In [None]:
enable_plotly_in_cell()
accuracy_of_models = {'SVC': svm_accuracy,
                      'Random Forest': rf_accuracy,
                      'Logistic Regression': lr_accuracy,
                      'KNN': knn_accuracy,
                      'Decision Tree': dtree_accuracy}


fig = px.bar(x = list(accuracy_of_models.keys()), y= list(accuracy_of_models.values()),
             color = list(accuracy_of_models.values()),
             width = 800, height = 400,
             color_discrete_sequence=px.colors.qualitative.G10,
             labels={'x':'Classifier', 'y':'Accuracy'}, text_auto=True)


fig.update_layout(title='Accuracy performance of classification models')
fig.show()

Compare confusion matrix for each model

In [None]:
confusion_matrix_of_models = {'SVC': svm_confusion_matrix,
                      'Random Forest': rf_confusion_matrix,
                      'Logistic Regression': lr_confusion_matrix,
                      'KNN': knn_confusion_matrix,
                      'Decision Tree':dtree_confusion_matrix}

Let's find the best model by comparing each of the models we have evaluated against each other

In [None]:
enable_plotly_in_cell()
scores = [svm_accuracy, rf_accuracy, lr_accuracy, knn_accuracy, dtree_accuracy]
best_score = max(scores)
best_model = ''

for key, value in accuracy_of_models.items():
  if best_score == value:
    best_model = key
    fig = px.imshow(confusion_matrix_of_models[key], text_auto=True, aspect="auto",
                    color_continuous_scale='viridis',
                    x=['Not Dropout', 'Dropout'],
                    y = ['Not Dropout', 'Dropout'],
                    labels=dict(x="Actual", y="Prediction"))

    fig.update_layout(title = f'{key} Matrix', height=500, width=800)
    fig.show()

    confusion_matrix_of_models.pop(key)

We get the Logistic Regression and Random forest as the best models

Let's review confusion matrix for other models

In [None]:
enable_plotly_in_cell()
colors = ['plotly3', 'RdBu_r', 'thermal','rainbow']
position = 0


fig = make_subplots(rows=1, cols=4, subplot_titles=list(confusion_matrix_of_models.keys()))

for key, value in confusion_matrix_of_models.items():
    heatmap = go.Heatmap(z=confusion_matrix_of_models[key], text=confusion_matrix_of_models[key],
                        colorscale=colors[position],
                        x=['Not Dropout', 'Dropout'],
                        y=['Not Dropout', 'Dropout'],
                          texttemplate="%{text}",
                         showscale = False)

    fig.add_trace(heatmap, row=1, col=position+1)

    position += 1


fig.update_layout(height=500, width=1700, title_text="Other models" )

fig.show()

After reviewing the confusion matrix of we can conclude that Logistic Regression and Random Forest are the best models.

We can evaluate between these two based on the analysis we have alreday done.

In [None]:
Markdown(f"""
#### From the results above we can see that {best_model} perfoms best with the highest accuracy of {round(best_score * 100, 2)}%""")

# Preparing Data for Web Application

Get feature columns used for training the models

This is needed for the web application. We will be using this to build our application

In [None]:
selected_feat= X_train.columns
print("The features used for the model: ")
for feature in selected_feat:
  print(feature)
print("The number of features : {}".format(len(selected_feat)))


#Save Model

Save selected features

We save it as part of the model to make it easier to consume on the web application. This way we don't need to generate another file and store it

In [None]:
lr.feature_names = list(selected_feat)

Save model

We save the model. We will use this saved model to build our web application

In [None]:
joblib.dump(lr, 'dropout_prediction_model.pkl')

#Test modal

Let's quickly test our modal to see it predicts based on our analysis

In [None]:
test_model = joblib.load('dropout_prediction_model.pkl')

We can see we get similar results to what we go while training

In [None]:
result = test_model.score(X_test, y_test)
print(result)

In [None]:
final_dataset.sample(10)

# Conclusion

Hence, we can conclude that the Logistic Regression is the best model to use for our dataset. The Random forest model was a close second. In the end, we decided to go with Logistic Regression due to it's higher accuracy


In conclusion, our machine learning project aimed at predicting student dropout has yielded significant insights and outcomes. The issue of student dropout is a pressing concern for educational institutions and society as a whole. Completing this project has allowed us to apply the concepts and techniques we've learned in our machine learning course to tackle a real-world problem with far-reaching implications.

Throughout our analysis, we leveraged various machine learning algorithms, explored relevant datasets, and employed feature engineering to build a predictive model. We also identified and considered a range of factors that contribute to student dropout, including academic performance, socio-economic background, and other demographic variables.

While our project has made significant progress in predicting student dropout, it's essential to acknowledge that this issue is complex and multifaceted. Future work can focus on refining the model, incorporating more diverse and real-time data sources, and enhancing the accuracy of predictions.

The ability to predict student dropout opens doors to improved educational outcomes, greater equality, and a brighter future for students everywhere.


# Bibliography

1. Mehreen Saeed, Modeling Pipeline Optimization With scikit-learn
URL - https://machinelearningmastery.com/modeling-pipeline-optimization-with-scikit-learn/

2. Pratik Parmar, Enable plotfly in a cell in colab
URL - https://stackoverflow.com/a/54771665

3. Build a function to get dicitonaries -URL - https://stackoverflow.com/questions/8653516/search-a-list-of-dictionaries-in-python

4. Gilbert Tanner, Building web app with streamlit and deploying wit Heroku - URL - https://gilberttanner.com/blog/deploying-your-streamlit-dashboard-with-heroku/

5. M.A. Al-Barrak,Muna S. Al-Razgan, Predicting students’ performance through classification: Journal of Theoretical and Applied Information Technology 75(2):167-175 URL - https://www.researchgate.net/publication/282381796_Predicting_students'_performance_through_classification_A_case_study


# Note from the Author


This file was generated using The NBConvert, additional information on how to
prepare articles for submission is [here](https://github.com/ad17171717/YouTube-Tutorials/blob/main/Google%20Colab%20Tutorials/Exporting%20to%20a%20PDF%20Format/Google_Colab_Exporting_to_a_PDF_Format!.ipynb).

The article itself is an executable colab Markdown file that could be downloaded from [Github](https://github.com/kunwarRaj20/classification) with all the necessary artifacts.

Link to the web application - [Student Dropout Predictor](https://py-test-1-964598a9c406.herokuapp.com/)

1. Kunwar Rajdeep Singh - York University School of Continuing Studies