Moving forward, we will be using google colab to run our code. You can access google colab with this link: https://colab.research.google.com/

# Statistics and Data Visualizations

For this lesson we will need the following packages; pandas, matplotlib, seaborn, scipy, numpy.

## Section 1

### Basic summary statistics

Basic summary statistics provide an overview of the central tendency, dispersion, and shape of a variable.
Python's pandas library offers a convenient method called `describe()` to generate summary statistics for a DataFrame.

In this example, we import the pandas library and load the data from a CSV file into a DataFrame named data. The `describe()`
method is then called on the DataFrame to calculate summary statistics such as count, mean, standard deviation, minimum, quartiles,
and maximum for each numerical column in the data. The result is stored in the summary variable and printed.

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt


# Load the data into a DataFrame
df_students = pd.read_csv('students_performance.csv')

# Generate summary statistics
summary = df_students.describe()
print(summary)

### Qualitative vs Quantitative data

Data can be either qualitative or quantitative.

Qualitative data refers to data that can be usually divided into groups. For example, data about marriage status can be sorted by unmarried, married, or engaged. They can also be divided into categories that have no order(nominal variables) such as the marriage example or categories that can be ordered(ordinal) such as a person's generation(Ex: Gen z, Millenial, etc...). 

Quantitative data are numerical variables, representing data that can be measured or counted and are typically expressed as numbers. Examples could be the weight of a bar, the number of people who finished a race, etc... Quantitative data can be separated further into discrete and continuous variables. Discrete variables are often found from counting and are usually finite. Examples could be how many siblings you have, the cars someone sells in a month, etc... Continuous variables can take on any value and can be fractional or decimal values. Examples could be height, age, and temperature.

<span style ="background-color:yellow">
TODO: Print out the statistical summary for your data set and interpret what the values represent to your classmates. Please write down which columns are qualitative and which ones are quantitative.
</span>

### Counting Missing Values and Identifying Outliers

Missing values and outliers are important considerations in data analysis. Python's pandas library provides functions to count missing values and identify outliers.

To count missing values:

In [None]:
# Count missing values in each column
missing_values = df_students.isnull().sum()
print(missing_values)

The `isnull()` function checks each element in the DataFrame and returns `True` if it is missing `(NaN)` and `False` otherwise. By calling `sum()` on the result, we obtain the count of missing values for each column.

Since there are some missing values detected let's explore how to clean our data. For quantitative data, you might fill in the average for a column. For example, let's populate the missing math and reading scores with the average values for those columns.

In [None]:
# Find the average of a math score
math_average = df_students['math score'].mean()

# Fill empty cells with the average
df_students['math score'].fillna(math_average, inplace=True)

# Find the average of a column
reading_score = df_students['reading score'].mean()

# Fill empty cells with the average
df_students['reading score'].fillna(reading_score, inplace=True)

What if our data is categorical/qualitative? Let's fill those values with the most frequent response or the 'mode', and apply that to the lunch column.

In [None]:
# Calculate the mode for the specific column
lunch_mode = df_students['lunch'].mode()[0]

print(f"This is the most common lunch '{lunch_mode}'")

# Fill missing values in the specific column with the mode
df_students['lunch'].fillna(lunch_mode, inplace=True)


Let's check back in to see if we filled in some of those missing values.

In [None]:
# Count missing values in each column
missing_values = df_students.isnull().sum()
print(missing_values)

## Section 2

### Normal & Bimodal Distribution

Let's observe frequencies and relative frequencies of data in our dataframe. It can be helpful to contextualize the type of data we have and the different categories it can be split up into. Furthermore, it can help us choose the best way to visualize the data.

In [None]:
data = df_students['lunch']  # Example data

frequency_table = {}

for value in data:
    if value in frequency_table:
        frequency_table[value] += 1
    else:
        frequency_table[value] = 1

print(f"Frequency table: {frequency_table}")


data = df_students['lunch']  # Example data

relative_frequency_table = {}

total_values = len(data)

for value in data:
    if value in relative_frequency_table:
        relative_frequency_table[value] += 1 / total_values
    else:
        relative_frequency_table[value] = 1 / total_values

print(f"Relative frequency table: {relative_frequency_table}")

Let's explore what a normal distribution looks like using a histogram. A histogram allows us to view categorical data continuously, which will give a visual representation of our frequency data from the prior cells.

In this example, we generate 1000 random samples from a normal distribution with a mean of 0 and a standard deviation of 1 using `np.random.normal()`. The generated samples are then plotted as a histogram using
`plt.hist()`. The bins parameter specifies the number of bins for the histogram, `density=True` ensures that the histogram is normalized, alpha controls the transparency of the bars, and color sets the color of the bars to blue.

Finally, the code sets the x-axis label to `"Value"`, the y-axis label to `"Probability Density"`, and the title of the plot to `"Normal Distribution"` using `plt.xlabel(), plt.ylabel()`, and `plt.title()` respectively.

The `linspace()` function is for generating a red line showing the probability density function. 

Running this code will display a plot with a histogram representing the generated random samples from the normal distribution.

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Set the parameters for the normal distribution
mean = 0
std_dev = 1

# Generate random samples from a normal distribution
samples = np.random.normal(mean, std_dev, 1000)

# Plot the histogram of the samples
plt.hist(samples, bins=30, density=True, alpha=0.7, color='blue')

x = np.linspace(-5, 5, 100)
pdf = (1 / (std_dev * np.sqrt(2 * np.pi))) * np.exp(-(x - mean)**2 / (2 * std_dev**2))
plt.plot(x, pdf, color='red', linewidth=2)

# Set plot labels and title
plt.xlabel('Value')
plt.ylabel('Probability Density')
plt.title('Normal Distribution')

# Show the plot
plt.show()

To create an example of bimodal data, we can generate random samples from a mixture of two Gaussian(bell-shaped) distributions. Here's the modified code:

In [None]:
# Generate random samples from a mixture of two Gaussian distributions
mean1 = -2
std_dev1 = 1
samples1 = np.random.normal(mean1, std_dev1, 500)

mean2 = 2
std_dev2 = 0.5
samples2 = np.random.normal(mean2, std_dev2, 500)

samples = np.concatenate((samples1, samples2))

# Plot the histogram of the samples
plt.hist(samples, bins=30, density=True, alpha=0.7, color='blue')

# Set plot labels and title
plt.xlabel('Value')
plt.ylabel('Probability Density')
plt.title('Bimodal Data Example')

# Show the plot
plt.show()

Distribution curves are how a data set looks when plotted on a graph. Distributions can be skewed to the right or the left or a various number of shapes. In the example below a histogram with 100 separate random samples from a normal distribution are taken and then plotted on a histogram.   

## Section 3

### Skews

Understanding the distribution of variables helps in assessing their shape, skewness, and potential data transformations. Python offers several visualization libraries to examine variable distributions. Let's use Matplotlib to create a histogram and Seaborn to generate a box plot.

In [None]:
# Plot a histogram of a variable
plt.hist(df_students['math score'], bins=20)

# We use the .xlabel() function to give the x-axis a label
plt.xlabel('Variable')

# We use the .ylabel() function to give the y-axis a label
plt.ylabel('Frequency')

# We use the .title() function to give the histogram a title
plt.title('Histogram of Variable')

#displays the plot
plt.show()

This slightly left-skewed data should have a mean that is pulled down by its outliers. Let's verify that the mean is smaller than the median.

In [None]:
reading_median = df_students['reading score'].median()
print(F"The median reading score is: {reading_median}")
df_students['reading score'].describe()

Since the skew was quite small, you can see that the mean is only a bit smaller than the median.

## Section 4

### Standard Deviation/Statistic calculations

Learning about where the statistic calculations come from is important to understand how they correlate with each other
and what their significance is. 

In the example below we are taking the mean math score by using the `.mean()` function. Using this we calculate the variance of the data for the math scores. To calculate variance we take the residual of each data point and square it. We can use the `.std()` function to find the standard deviation for each data point. 

To find the z-score, a measure of how many standard deviations a data point is from the mean. we have to take the 

In [None]:
# Assuming 'math score' is a column in your DataFrame
# Calculate mean math score
mean_math_score = df_students['math score'].mean()

# Create a new column 'mathscorevariance' representing the variance of each math score from the mean. The
# variance is the square of the difference between a value and the mean for that value.
df_students['math_score_variance'] = (df_students['math score'] - mean_math_score)**2

# Calculate standard deviation of all the math scores. This is done by summing all of the squares from
# above, then dividing them by either the # of records (N) for an entire population, the number of
# records minus 1 (N-1) if you only have a sample.
std_math_score = df_students['math score'].std()

# Create a new column 'math_score_std_dev' representing the standard deviation of each math score from the
# mean. This is often called a Z-Score. Knowing the zscore for a record helps inform us how far
# a value is from the mean.
df_students['math_score_std_dev'] = (df_students['math score'] - mean_math_score) / std_math_score

df_students[['math_score_std_dev', 'math score', 'math_score_variance']]

As is normally the case for programming, there is a module that can calculate things much more easily for you. Here we will use the scipy package to calculate the z score in a single line of code, then cross-reference the results with our above dataframe data.

In [None]:
from scipy import stats

In [None]:
# Calculate z-scores for each value in a column. The nan_policy argument to this function will omit any null (or "non-number") values in this column.
df_students['math_score_std_dev'] = stats.zscore(df_students['math score'], nan_policy='omit')

df_students[['math_score_std_dev']]

## Section 5

### Visualizing Categorical Data

This code creates a bar plot using the Seaborn library in Python. The `sns.barplot()` function takes three arguments:

- x: The column name of the categorical variable to be plotted on the x-axis.
- y: The column name of the continuous variable to be plotted on the y-axis.
- data: The name of the DataFrame that contains the data to be plotted.

In this case, the x argument is set to "lunch", which is a categorical variable that represents whether a student receives free/reduced lunch or not. The y argument is set to "math score", which is a continuous variable that represents a student’s math score. Finally, the data argument is set to `df_students`, which is the name of the DataFrame that contains the data.

The resulting plot will show the average math score for students who receive free/reduced lunch and those who do not.

### Bar plots

In [None]:
# Create a bar plot
sns.barplot(x="lunch", y="math score", data=df_students)

### Side-by-Side Bar Chart

In [None]:
import matplotlib.pyplot as plt
import numpy as np

# Sample data
categories = ['math score', 'reading score']
values1 = df_students['math score'].mean()
values2 = df_students['reading score'].mean()

# Set the positions of the bars on the x-axis
bar_width = 0.35
bar_positions1 = np.arange(len(categories))
bar_positions2 = bar_positions1 + bar_width

# Create the figure and axes
fig, ax = plt.subplots()

# Plot the bars for the first set of values
ax.bar(bar_positions1, values1, width=bar_width, label='Set 1')

# Plot the bars for the second set of values
ax.bar(bar_positions2, values2, width=bar_width, label='Set 2')


# Add labels, title, and legend
ax.set_xlabel('Scores')
ax.set_ylabel('Values')
ax.set_title('Side-by-Side Bar Chart')
ax.set_xticks(bar_positions1 + bar_width / 2)
ax.set_xticklabels(categories)
ax.legend()

# Display the chart
plt.show()


### Pie Chart

This code creates a pie chart using the Matplotlib library in Python. The `value_counts()` function is used to count the number of occurrences of each unique value in the "race/ethnicity" column of the DataFrame `df_students`. The resulting counts are stored in the `category_counts` variable.

The `plt.pie()` function takes three arguments:

- x: The values to be plotted. In this case, it is the values of `category_counts`.
- labels: The labels for each value. In this case, it is the index of `category_counts`.
- autopct: A string or function used to format the percentage values.

The resulting pie chart will show the distribution of race/ethnicity in the DataFrame.

In [None]:
# Create a pie chart
category_counts = df_students['race/ethnicity'].value_counts()
plt.pie(category_counts.values, labels=category_counts.index, autopct='%1.1f%%')
plt.title('Distribution of Race/Ethnicity')
plt.show()


## Section 6

### Visualizing Quantitative Data

Box-Whisker plot

In this example, we utilize `sns.boxplot()` from the seaborn library to create a box plot of the variable in the DataFrame data. The x parameter is set to the column of interest. Similarly, `xlabel()` and `title()` are used to label the x-axis and provide a title to the plot. Finally, `plt.show()` displays the box plot. The points towards the left of the graph can be categorized as outliers. 

In [None]:
# Generate a box plot of a variable
sns.boxplot(x=df_students['math score'])
plt.xlabel = ('Variable')
plt.title('Box Plot of Variable')
plt.show()

## Scatterplot

The `sns.scatterplot()` function takes three arguments:

- data: The name of the DataFrame that contains the data to be plotted.
- x: The column name of the variable to be plotted on the x-axis.
- y: The column name of the variable to be plotted on the y-axis.

In this case, the data argument is set to `df_students`, which is the name of the DataFrame that contains the data. The x argument is set to `"math score"`, which is a variable that represents the persons math score. The y argument is set to `"reading score"`, which is another variable that represents the persons reading score.

In [None]:
# Read csv data
sns.scatterplot(data = df_students, x = "math score", y = "reading score")

<span style ="background-color:yellow">
TODO - Using any of the above-mentioned graphs spend the next 5 minutes collecting, aggregating, and sorting your data. Then gather the necessary inputs and create a graph most useful for a reader trying to interpret your data. Afterwards, have a 5 minutes discussion going over every step of the process with another student
<span\>