# PROGRAMMING FOR DATA ANALYSIS PROJECT



### OBJECTIVES

1 Source a dataset which has at least 100 data points across 4 variables.

2 Investigate the variables, their likely distributions and their relationships
    with each other.
    
3 Simulate a data set as closely matching the properties of the real world
    data set as possible.


#### Sources

P. Cortez and A. Silva. Using Data Mining to Predict Secondary School Student Performance. In A. Brito and J. Teixeira Eds., Proceedings of 5th FUture BUsiness TEChnology Conference (FUBUTEC 2008) pp. 5-12, Porto, Portugal, April, 2008, EUROSIS, ISBN 978-9077381-39-7. 
http://www3.dsi.uminho.pt/pcortez/student.pdf
 These two sources were very useful.  I found my dataset in the first source and used the second to find out how the Cortez & Silva
study related to predicting success in education by examining variables in the students' lives.

http://uis.unesco.org/country/PT
    Provides data of population of Portugal broken down by age.
    
https://www.oecd-ilibrary.org/docserver/9789264117020-4-en.pdf
    Provided me with information about Portugese school system.
   
https://www.shanelynn.ie/using-pandas-dataframe-creating-editing-viewing-data-in-python/
    This source helped me to tidy up the dataset into the columns and rows I wanted to keep.

https://thispointer.com/python-pandas-how-to-display-full-dataframe-i-e-print-all-rows-columns-without-truncation/
    This source helped me to find a way to display all the data in the dataset.
    
https://www.earthdatascience.org/courses/intro-to-earth-data-science/scientific-data-structures-python/pandas-dataframes/indexing-filtering-data-pandas-dataframes/
    This source was very useful with appropriate functions to interpret the data.

https://seaborn.pydata.org/
    This source helped me to use correct syntax to plot my data.  It was also excellent to help me choose which plots to use.
    
https://www.bing.com/videos/search?q=normal+distribution+python
    A useful video to help me code a normal distribution plot.
    
https://stackoverflow.com/questions/16312006/python-numpy-random-normal-only-positive-values
    All my values are positive so a normal distribution will not show my data.  This site 
    suggested I try a binomial distribution.  A binomial distribution is a discrete distribution
    which only represents yes/no or true/false options so this will not do.
    
https://python-graph-gallery.com/25-histogram-with-several-variables-seaborn/
https://towardsdatascience.com/histograms-and-density-plots-in-python-f6bda88f5ac0
    These sites helped me with data snippets I could manipulate to create plots to display 
    my dataset.
    
https://stackoverflow.com/questions/10138085/python-plot-normal-distribution 
http://www.johndcook.com/distributions_scipy.html 
http://docs.scipy.org/doc/scipy/reference/stats.html 
http://telliott99.blogspot.com/2010/02/plotting-normal-distribution-with.html
The stackoverflow site offered some code to help me create a way to plot a normal distribution using my data mean and standard deviation. 
The other three urls are the sources of the code which I modified to plot my data. I had to abandon this method of illustration as it gave me negative values.

https://www.youtube.com/watch?v=zQy0lEfXsVI 
Noureddin Sadawi runs this Youtube site on Pandas Dataframes, it helped me to construct my histograms.

https://seaborn.pydata.org/tutorial/relational.html 
I used this site to plan my other plots to correlate data.
        





### FINDING A DATASET - Objective 1

Silva and Cortez studied students in two subjects, Portugese and Maths.  I selected the csv for Maths because I thought 
it would be more relatable for me to choose a universal school subject.
I predict there is a link between the variables; school attendance, mother's education, father's education, plans to attend 
higher education and achievement.  I will use the dataset produced by Cortez and Silva which includes these variables among others.
First I will import the dataset and eliminate the other variables until I have the ones I want to investigate.  

In [None]:
# Import library and dataset.
import pandas as pd

# The dataset was in a zipped file so I had to unzip it, save it to my local machine and then read the filename into
# Jupyter.  The file was not saved using commas as delimiters so I added the parameter delimiter = ; so Pandas could
# organise the information into a separated file of rows and columns.  The folder is saved as student with the other files
# relating to this project.  This original folder has four files from which I have chosen the file to do with 
# achievement in maths.

dfgrade = pd.read_csv("student/student-mat.csv", delimiter=";")

# Check the types of data; object, int etc.
dfgrade.dtypes

# Show the full dataframe, I have commented this as it is very big.
#print(dfgrade)


In [None]:
import pandas as pd
dfgrade = pd.read_csv("student/student-mat.csv", delimiter=";")


# Use the drop function to take all the columns out which I will not be using.
dfgrade = dfgrade.drop(["school", "sex", "age", "address","famsize", "Pstatus", "Mjob", "Fjob", "reason", "guardian", "traveltime", 
             "studytime", "failures", "schoolsup", "famsup", "paid", "activities", "nursery", "internet", 
              "romantic", "famrel", "freetime", "goout", "Dalc", "Walc", "health"], axis=1)
#print(dfgrade)

# Having looked at all the data there is a large number of students who got
# 0 as their final grade (G3).  I am going to take this group out of the cohort
# because they are skewing the statistics.  I assume for one reason or another they
# did not take the final exam.  I found this out further down and came back to this cell.

dfgrade = dfgrade.drop([128, 130, 131, 134, 135, 136, 136, 137, 140, 144, 146, 148,
           150, 153, 160, 162, 168, 170, 173, 221, 239, 242, 244,
       259, 264, 269, 296, 310, 316, 332, 333, 334, 337, 341, 367, 387, 389])

# Show the new dataframe, it now has 359 rows and 7 columns, excluding the index column.
print(dfgrade)





### INVESTIGATION OF DATA SET - Objective 2

In [None]:
# Information about the dataframe, the variables and their datatype.
dfgrade.info()


In [None]:
# Statistical information about the variables showing the mean, standard deviation,
# quartile values, maximum and minimum values.
dfgrade.describe()


In [None]:
# To examine one data point use the iloc method to locate the piece of data
# in the first position on the first row.
dfgrade.iloc[0:1,0:1]


In [None]:
# To examine the data of one student at index 353
# This student has 2 parents who left school on or before age 9, he/she hopes
# to go to higher education, he/she has had 4 absences in the year and he/she
# has scored 8 out of 20 in the three tests that year, a failing grade.

dfgrade.set_index("Medu")
dfgrade.loc[(353)]


In [None]:
# Identify the name of each column.

print(dfgrade.columns)




#### COMPARING G3 TO MOTHERS' EDUCATION

In [None]:
# Use pandas to cross tabulate the final year grade G3 with mother's education. This data shows us that
# only 3 students over the pass grade of 8 had mothers who had no education.  At the other end of the scale students
# who scored poorly (below 8) 18 of their mothers had a third level education.  The remaining students (107) of mothers 
# who had 3rd level education scored above 8.  The highest number of students still in school at secondary level are from
# families where their mothers have achieved a 3rd level education(125).

pd.crosstab(dfgrade["Medu"], dfgrade["G3"], margins = True)


In [None]:
# Use seaborn to plot the data above to examine the relationship between mother's grade
# and child performance in school. A count plot can is a histogram across 
# a categorical, instead of quantitative, variable. 

import seaborn as sns

sns.countplot(x = "G3", hue="Medu", data = dfgrade);

# The plot clearly shows a high correlation between mother's education and the child's
# G3 school grade.  The purple bar represents mothers with 3rd level education and this
# bar is most highly correlated with children in the 14 - 20 grade scores.


In [None]:
# Seaborn's relplot probably gives a clearer visualisation of the relationship between
# the student's final grade (G3) and his/her mother's education.  The different shades of purple
# relate to mother's education, and we can see more students with higher grades have mothers with higher education.
sns.relplot(x="G3", y = "Medu", hue = "Medu", data=dfgrade);

In [None]:
# Plotting two variables, mothers' education and G3 on the same Axis
import matplotlib.pyplot as plt
sns.distplot( dfgrade["Medu"] , color="skyblue", label="Medu")
sns.distplot( dfgrade["G3"] , color="red", label="G3")
plt.legend()
 
plt.show()

# This plot shows the correlation between mother's education and student grade.



#### COMPARING G3 TO FATHERS' EDUCATION

In [None]:
# Use pandas to cross tabulate the final year grade G3 with father's education. This data shows us that
# only 2 students over the pass grade of 8 had fathers who had no education.  At the other end of the scale students
# who scored poorly (below 8) 11 of their fathers had a third level education.  The remaining students (77) of fathers 
# in the 3rd level category scored above 8.  The highest number of students still in school at this stage are from
# families where their fathers have achieved a primary education. 

# It seems therefore that mother's education correlates more positively with children's achievement.

pd.crosstab(dfgrade["Fedu"], dfgrade["G3"], margins = True)


In [None]:
# Use seaborn to plot the data above to examine the relationship between father's grade
# and child performance in school.

sns.countplot(x = "G3", hue="Fedu", data = dfgrade);

# The plot shows a medium correlation between father's education and the child's
# G3 school grade.  The purple bar represents fathers with 3rd level education and this
# bar is correlated with children in the 14 - 20 grade score.


In [None]:
# Seaborn's relplot probably gives a clearer visualisation of the relationship between
# the student's final grade (G3) and his/her father's education.  The different shades of purple
# relate to father's education, and we can see students with higher grades have fathers with higher education.
# The correlation is not as close between Fedu and G3 as it was with Medu and G3.

sns.relplot(x="G3", y = "Fedu", hue = "Fedu", data=dfgrade);

In [None]:
# Plotting fathers' education and G3 on the same Axis
sns.distplot( dfgrade["Fedu"] , color="skyblue", label="Fedu")
sns.distplot( dfgrade["G3"] , color="red", label="G3")
plt.legend()
 
plt.show()

# This plot shows the correlation between father's education and grade score.



#### COMPARING G3 AND ABSENCES

In [None]:
# Show the cross tabulation of the G3 score with the number of days
# the student was absent from school.

pd.crosstab(dfgrade["absences"], dfgrade["G3"], margins = True)


In [None]:
# Using seaborn barplot to demonstrate the correlation between the 
# number of absent days and the G3 grade the students scored.  The barchart
# shows clearly that students who missed 20+ days of school were
# mainly represented in the lower scores and students with 5 or fewer
# absence days are highly represented in the 14+ grade point scores. 

sns.barplot(x = "G3", y="absences", ci = None, data = dfgrade);


In [None]:
# Plotting absences and G3 on the same Axis
sns.distplot( dfgrade["absences"] , color="skyblue", label="absences")
sns.distplot( dfgrade["G3"] , color="red", label="G3")
plt.legend()
 
plt.show()

# This plot shows the high correlation between fewer absences and higher grade score.



#### COMPARING G3 TO 3RD LEVEL INTENTIONS

In [None]:
# Crosstabulating final grade, G3, with intention to go to 3rd Level.

# This data cannot be said to be useful in predicting grades
# for students in maths.  Because this is a secondary school most
# students attending would intend to study at third level.  It seems
# that in the Portugese educational system it is normal to leave
# at 15 if you are not planning 3rd level.  Therefore it is safe to
# assume that those who go to secondary want to go to third level even
# if their grades do not make that a likely outcome.  I have discontinued
# using this variable for any further analysis.


pd.crosstab(dfgrade["higher"], ["G3"], margins = True)




#### PLOTTING VARIABLES TOGETHER

In [None]:
# plot the four variables together.
f, axes = plt.subplots(2, 2, figsize=(7, 7), sharex=True)
sns.distplot( dfgrade["Medu"] , color="skyblue", ax=axes[0, 0])
sns.distplot( dfgrade["Fedu"] , color="olive", ax=axes[0, 1])
sns.distplot( dfgrade["G3"] , color="gold", ax=axes[1, 0])
sns.distplot( dfgrade["absences"] , color="teal", ax=axes[1, 1])

# These plots display that the mothers in the sample have the wider differences in education levels, the father's education
# levels are concentrated more.  The third plot shows that the grades attained by the sample average around 10/11.  The fourth
# plot shows that the majority of students were absent for 10 days or less.


### PLOTTING THE FOUR VARIABLES SEPARATELY

In [None]:
# Creating a histogram to illustrate mothers' educational attainment.

from pandas import DataFrame

Medu = dfgrade.iloc[:,[0]]
Medu.hist(bins = 20)
plt.title("Mothers' Education")
plt.xlabel("Education Level")
plt.ylabel("Number of Mothers/Frequency")




In [None]:
# Creating a histogram to illustrate fathers' educational attainment.

import numpy as np

Fedu = dfgrade.iloc[:,[1]]
                    
Fedu.hist(bins = 20)
                    
plt.title("Fathers' Education")
plt.xlabel("Education Level")
plt.ylabel("Number of Fathers/Frequency")



In [None]:
# Creating a histogram to illustrate absenteeism in the simulated student group.

absences = dfgrade.iloc[:,[3]]
#print(absences)

absences.hist(bins = 20)
plt.title("Absences")
plt.xlabel("Days absent")
plt.ylabel("Number of Students")



In [None]:
# Creating a histogram to illustrate final year grades (G3).

G3 = dfgrade.iloc[:,[6]]
G3.hist(bins = 20)
plt.title("Final Grade")
plt.xlabel("Score")
plt.ylabel("Number of Students/Frequency")



#### Projection
These histograms will be useful to see if the simulated data I create is representative of the original real world
data.  I will compare each with the histograms produced on the simulated data.



### CREATE DATASETS WHICH SIMULATE THE FOUR VARIABLES (MEDU, FEDU, ABSENCES AND G3)- Objective 3

#### EXTRA SOURCES

https://numpy.org/doc/stable/reference/generated/numpy.histogram.html
Useful to check correct syntax.

https://datatofish.com/list-to-dataframe/
I realised I had to change lists to dataframes so I could compare/contrast.

https://stackoverflow.com/users/959876/moldovean
A very handy guide to loc and iloc to isolate a column.

https://scipy-lectures.org/packages/scikit-learn/index.html
I did a lot of research on machine learning when trying to find some way to
make the simulated data similar to the real world data, this site was very informative.

https://www.tutorialspoint.com/numpy/numpy_statistical_functions.html
This site helped me adapt the function I got from the runaway horse source below to my situation.
    
https://github.com/runawayhorse001/statspy/blob/master/statspy/basics.py
This was the most useful source I found and I have been down many disappointing alleyways! It gave
me a defined function that I could apply to my real world data to recreate datasets which had the 
same number of samples, the same mean and standard deviation.  A simulated dataset which shares these in common with a real world dataset is representative of the real world dataset.

https://www.youtube.com/watch?v=xlD8FIM5biA from a website called OSPY.
This site helped me to save my images correctly and import them into my notebook.

In [None]:
# Creating a dataset to simulate Medu

# Source of function 
# https://github.com/runawayhorse001/statspy/blob/master/statspy/basics.py

from scipy.stats import norm

def rnorm(n, mean=0, sd=1):
    """
    Random generation for the normal distribution with mean 
    equal to mean and standard deviation equation to sd
    same functions as rnorm in r: ``rnorm(n, mean=0, sd=1)``
    :param n: the number of the observations
    :param mean: vector of means
    :param sd: vector of standard deviations
    :return: the vector of the random numbers  
    :author: Wenqiang Feng
    :email:  von198@gmail.com
    """
    return norm.rvs(loc=mean, scale=sd, size=n)


mu = 2.788301
SD = 1.095841
N = 359

simulated_data = rnorm(359, 2.788301, 1.095841)
print(simulated_data)

# This function returns 359 samples with the given mean and standard deviation.



In [None]:
# preparing to plot the simulated Medu dataset.

a = simulated_data

hist, bins = np.histogram(a, bins = [0,1,2,3,4])
print(hist)
print(bins)

In [None]:
# A histogram to illustrate mothers' simulated education levels

a = simulated_data
plt.hist(a, bins = [0,1,2,3,4])
plt.title("Mothers' Education")
plt.xlabel("Education Level")
plt.ylabel("Number of Mothers/Frequency")

# this is not completely comparable to the real data above, where most of the sample had achieved 
# 3rd level education (level 4)

In [None]:
# using image from ipython to bring the histogram of the real world 
# Medu data to this position.

from IPython.display import Image
Image(filename="img/medu.png")

In [None]:
# Creating a dataset to simulate Fedu.

mu =2.540390 
SD =1.084637
N = 359

# This function returns 359 samples with the given mean and standard deviation.

simulated_data1 = rnorm(359,2.540390 , 1.084637)
print(simulated_data1)


In [None]:
#prepare to plot the dataset of fathers' education/Fedu.

b = simulated_data1
hist, bins = np.histogram(b, bins =[0,1,2,3,4])
print(hist)
print(bins)

In [None]:
# A histogram to illustrate simulated dataset for fathers' education. 

b = simulated_data1

plt.hist(b, bins = [0, 1, 2, 3, 4])
plt.title("Fathers' Education")
plt.xlabel("Education Level")
plt.ylabel("Number of Fathers/Frequency")

# This plot is closer to the plot from the real dataset.


In [None]:
# using image from ipython to bring the histogram of the real world 
# Fedu data to this position.

Image(filename="img/fedu.png")

In [None]:
# Creating a dataset to simulate Absences

mu=6.281337
SD=8.178283
N=359

# This produces a simulated data set but because the standard deviation is so high,
# there are many negative values.  I decided to square all the values then get their
# square roots to have all positive values.

# This function returns 359 samples with the given mean and standard deviation.
simulated_data2 = rnorm(359, 6.281337, 8.178283)
#print(simulated_data2)

# This code squares every element i on the list l.
l = simulated_data2
[i**2 for i in l]




In [None]:
# l is the original simulated dataset created by the function rnorm.  l1 is the dataset created
# by squaring every element of l.
l = simulated_data2
l1 = [i**2 for i in l]

# use the numpy sqrt function to get the square root of each element of l1, call the new list l2.
l2 = np.sqrt(l1)
#print(l2)

In [None]:
# Prepare to plot the absence data.

hist, bins = np.histogram(l2)
print(hist)
print(bins)

In [None]:
# Creating a histogram to illustrate absenteeism in the simulated student group.

plt.hist(l2)
plt.title("Absences")
plt.xlabel("Absent Days")
plt.ylabel("Number of Students")

# This dataset is still not correct and doesn't correlate with the real data.

In [None]:
# using image from ipython to bring the histogram of the real world 
# Absences data to this position.

Image(filename="img/absences.png")

In [None]:
# Creating a dataset to simulate G3

mu = 11.45961
SD = 3.33140
N = 359

simulated_data3 = rnorm(359, 11.45961, 3.33140)
print(simulated_data3)



In [None]:
# Preparing data to plot.
d = simulated_data3

hist, bins = np.histogram(d, bins = 20 )
print(hist)
print(bins)

In [None]:
# Creating a histogram to illustrate G3, the final year score in the simulated student group.

d = simulated_data3
plt.hist(d)
plt.title("G3")
plt.xlabel("Exam Score")
plt.xticks(range(1,21))
plt.ylabel("Number of Students/Frequency")


# this dataset correlates well with the real dataset for student final exam scores, G3.

In [None]:
# Using image from ipython to bring the histogram of the real world 
# G3 data to this position.

Image(filename="img/g3.png")

### Results

1. I sourced a suitable dataset of student attributes across a variety of variables.
2. I imported the dataset into this jupyter notebook and isolated 359 samples across
    seven variables, I eventually narrowed this down to four variables.
3. I analysed the data using crosstabulation and various plots.  I finally made histogram plots
    of each of four variables.
4. The final task was to create simulated data which shared characteristics with the real world
    data.  I used the function rnorm so I could base the simulated data on the real data.
5. I made histogram plots to illustrate the simulated datasets.  I then used Image to show the real
    world plots beside the simulated data plots.
  

### Conclusion

It is clear that the simulated data doesn't resemble the real world data as much as I would expect, 
especially the Absences dataset.  I am not entirely happy with the outcome but I have certainly 
learned a lot more, particularly in the area of research.