# Dataframes and Data Visualization Part 2

For this lesson we will need the following packages; pandas, mathplotlib.pylplot, seaborn, scipy, numpy.

## Section 1

### Basic summary statistics

Basic summary statistics provide an overview of the central tendency, dispersion, and shape of a variable.
Python's pandas library offers a convenient method called `describe()` to generate summary statistics for a DataFrame.

In this example, we import the pandas library and load the data from a CSV file into a DataFrame named data. The `describe()`
method is then called on the DataFrame to calculate summary statistics such as count, mean, standard deviation, minimum, quartiles,
and maximum for each numerical column in the data. The result is stored in the summary variable and printed.

In [None]:
import pandas as pd

# Load the data into a DataFrame
df_students = pd.read_csv('students_performance.csv')

# Generate summary statistics
summary = df_students.describe()
print(summary)

<span style ="background-color:yellow">
TODO: Print out the statistical summary for your data set and interpret what the values represent to your classmates. Based of this data what are some expectations you have for the shape of your data?
</span>

### Counting Missing Values and Identifying Outliers

Missing values and outliers are important considerations in data analysis. Python's pandas library provides functions to count missing values and identify outliers.

To count missing values:

In [None]:
# Count missing values in each column
missing_values = df_students.isnull().sum()
print(missing_values)

The `isnull()` function checks each element in the DataFrame and returns `True` if it is missing `(NaN)` and `False` otherwise. By calling `sum()` on the result, we obtain the count of missing values for each column.

Since there are some missing values detected we can go about filling them. This can be done using aggregate fucntions such as finding the mean of the other values, an example could be inputting the mean math score to input into the null values. 

In [None]:
import numpy as np

# Create an array with null values
data = np.array((df_students))

# Fill null values in string array with a specific value
filled_array = np.where(data == None, 'Unknown', data)

print(filled_array)

## Section 2

### Outliers, and Shape

Sometimes, data may have outliers. An outlier is a data point that differs significantly from other data points. It's existence may skew our data and impact other descriptive statistic values to the point where any conclusion we make from them are not accurate. We can identify outliers using the z-score method. A z-score is a statistical measurement that describes how far a data point is from the mean of the data. We can use a z-score as a cutoff point for reasonable data. In this example, we will use 3 as the cutoff.

To identify outliers using a z-score method check to see if there is something 3 standard deviations outside from the mean. 

In this example, we use the `zscore()` function from the scipy.stats module to calculate the z-scores for each value in a specific column, where column_name represents the column of interest. We then define a threshold (e.g., 3) to identify outliers. Values with z-scores exceeding the threshold are considered outliers, and we use the `abs()` function to take the absolute values of the z-scores. Finally, we create a DataFrame named outliers containing the rows with identified outliers and print it.

<span style ="background-color:yellow">
TODO: Examine your data and find out if their are any outliers or null values. Next create a boxplot and a histogram for your datasets. Once completed converse with another student talking about how any outliers or null values may have impacted your graphs. Next talk about any observations you had of your graphs and anything that popped out at you.
<span\>

## Section 3

### Normal Distribution and Bimodal Distribution

Here's an example of generating a normal distribution using NumPy and plotting it with matplotlib:

In this example, we generate 1000 random samples from a normal distribution with a mean of 0 and a standard deviation of 1 using `np.random.normal()`. The generated samples are then plotted as a histogram using `plt.hist()`. The bins parameter specifies the number of bins for the histogram, `density=True` ensures that the histogram is normalized, alpha controls the transparency of the bars, and color sets the color of the bars to blue.

Finally, the code sets the x-axis label to `"Value"`, the y-axis label to `"Probability Density"`, and the title of the plot to `"Normal Distribution"` using `plt.xlabel(), plt.ylabel()`, and `plt.title()` respectively.

Running this code will display a plot with a histogram representing the generated random samples from the normal distribution.

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Set the parameters for the normal distribution
mean = 0
std_dev = 1

# Generate random samples from a normal distribution
samples = np.random.normal(mean, std_dev, 1000)

# Plot the histogram of the samples
plt.hist(samples, bins=30, density=True, alpha=0.7, color='blue')

# Set plot labels and title
plt.xlabel('Value')
plt.ylabel('Probability Density')
plt.title('Normal Distribution')

# Show the plot
plt.show()

To create an example of bimodal data, we can generate random samples from a mixture of two Gaussian distributions. Here's the modified code:

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Generate random samples from a mixture of two Gaussian distributions
mean1 = -2
std_dev1 = 1
samples1 = np.random.normal(mean1, std_dev1, 500)

mean2 = 2
std_dev2 = 0.5
samples2 = np.random.normal(mean2, std_dev2, 500)

samples = np.concatenate((samples1, samples2))

# Plot the histogram of the samples
plt.hist(samples, bins=30, density=True, alpha=0.7, color='blue')

# Set plot labels and title
plt.xlabel('Value')
plt.ylabel('Probability Density')
plt.title('Bimodal Data Example')

# Show the plot
plt.show()

## Section 3

### Standard Deviation/Statistic calculations

Learning about where the statistic calculations come from is important to understanding how they coorelate with each other
and what their significance is. 

In the example below we are taking the mean math score by using the `.mean()` function. Using this we calculate the variance of the data for the math scores. To calculate variance we take the residual of each data point and square it. We can use the `.std()` function to find the standard deviation for each data point. 

To find the z-score, a measure of how many standard deviation a data point is from the mean. we have to take the 

In [None]:
# Assuming 'math score' is a column in your DataFrame
# Calculate mean math score
mean_math_score = df_students['math score'].mean()

# Create a new column 'mathscorevariance' representing the variance of each math score from the mean. The
# variance is the square of the difference between a value and the mean for that value.
df_students['math_score_variance'] = (df_students['math score'] - mean_math_score)**2

# Calculate standard deviation of all the math scores. This is done by summing all of the squares from
# above, then dividing them by either the # of records (N) for an entire population, the number of
# records minus 1 (N-1) if you only have a sample.
std_math_score = df_students['math score'].std()

# Create a new column 'math_score_std_dev' representing the standard deviation of each math score from the
# mean. This is often called a Z-Score. Knowing the zscore for a record helps inform us how far
# a value is from the mean.
df_students['math_score_std_dev'] = (df_students['math score'] - mean_math_score) / std_math_score

df_students[['math_score_std_dev', 'math score', 'math_score_variance']]

As is normally the case for programming, theres a module that can calculate things much more easily for you. Here we will use the scipy package to calculate the z score in a single line of code, then cross reference the results with our above dataframe data.

In [None]:
from scipy import stats

In [None]:
print(df_students['math score'].dtype)

# Calculate z-scores for each value in a column. The nan_policy argument to this function will omit any null (or "non-number") values in this column.
df_students['math_score_std_dev'] = stats.zscore(df_students['math score'], nan_policy='omit')

df_students[['math_score_std_dev']]

## Section 4

### Distribution curves

Distribution curves is how a data set looks when plotted on a graph. Distributions can be skewed to the right or the left or a various number of shapes. In te example below a histogram with 100 seperate random samples from a normal distribution are taken and then plotted on a histogram.   

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Generate random samples from a normal distribution
mean = 0
std_dev = 1
samples = np.random.normal(mean, std_dev, 1000)

# Plot the histogram of the samples
plt.hist(samples, bins=30, density=True, alpha=0.7, color='blue')

# Plot the probability density function (PDF) of the normal distribution
x = np.linspace(-5, 5, 100)
pdf = (1 / (std_dev * np.sqrt(2 * np.pi))) * np.exp(-(x - mean)**2 / (2 * std_dev**2))
plt.plot(x, pdf, color='red', linewidth=2)

# Set plot labels and title
plt.xlabel('Value')
plt.ylabel('Probability Density')
plt.title('Normal Distribution')

# Show the plot
plt.show()

## Section 5

### Creating Skewed data

Below is an example of right skewed data. In this case the mean is pulled to the right impacting the usability of that statistic. In cases like this median is usually a better representation of the data. The `np.random.exponential()` function allows us to programmatically create skewed data, so don't worry about what that function does.

In [None]:
import numpy as np
import matplotlib.pyplot as plt

gamma_shape = 2  # Shape parameter controlling the skewness
gamma_scale = 2  # Scale parameter controlling the spread
samples = np.random.gamma(gamma_shape, gamma_scale, 1000)

# Plot the histogram of the samples
plt.hist(samples, bins=30, density=True, alpha=0.7, color='blue')

# Set plot labels and title
plt.xlabel('Demo Value')
plt.ylabel('Probability Density')
plt.title('Gamma Distribution')

# Show the plot
plt.show()

The next example is Gamma distribution describes the graph by shape and rate. Below is an example of describing the graph by shape and scale with the parameters inputed into the `np.random.gamma` function. Gamma distributions are continously probabilty distribution just like a normal distribution. They tend to be right skewed,

## Section 6

### Examining the Distribution of Variables

Understanding the distribution of variables helps in assessing their shape, skewness, and potential data transformations. Python offers several visualization libraries to examine variable distributions. Let's use Matplotlib to create a histogram and Seaborn to generate a box plot.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

plt.hist(df_students['reading score'], bins = 15)
plt.xlabel = ('Scores')
plt.ylabel = ('Counts')
plt.title = ('Scores vs Counts for Reading Score')

In [None]:
# Plot a histogram of a variable
plt.hist(df_students['math score'], bins=20)

# We use the .xlabel() function to give the x-axis a label
plt.xlabel('Variable')

# We use the .ylabel() function to give the y-axis a label
plt.ylabel('Frequency')

# We use the .title() function to give the histogram a title
plt.title('Histogram of Variable')

#displays the plot
plt.show()

In [None]:
# Generate a box plot of a variable
sns.boxplot(x=df_students['math score'])
plt.xlabel('Variable')
plt.title('Box Plot of Variable')
plt.show()

In this example, we utilize `sns.boxplot()` from the seaborn library to create a box plot of the variable variable in the DataFrame data. The x parameter is set to the column of interest. Similarly, `xlabel()` and `title()` are used to label the x-axis and provide a title to the plot. Finally, `plt.show()` displays the box plot.

## Section 7

### Loading data into Pandas

If you are using repl, copy the contents of the csv into your repl and give it the same name of `pokemon_data.csv`

When you are importing these packages into your Colab workspace, import them in the top module and re run the module. Colab reads the import functions from this module.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

Histograms divide the data into bins, but to find the amount of bins could prove tricky while using a large dataset


In [None]:
#To find the amount of bins you can use the value_counts() function

df_student_vc = df_students['parental level of education'].value_counts()
print(df_student_vc)

This code creates a histogram using the Matplotlib library in Python. The `plt.hist()` function is used to create the plot with the following arguments:

- x: The data to be plotted. In this case, it is the "parental level of education" column of the DataFrame `df_students`.
- bins: The number of bins to use in the histogram.

In this case, the x argument is set to "parental level of education", which is a categorical variable that represents the education level of a student’s parents. The bins argument is set to 6, which means that the histogram will have 6 bins.

The `plt.xlabel()`, `plt.ylabel()`, and `plt.title()` functions are then used to add labels and a title to the plot.

### Histogram

In [None]:
# Create a histogram
plt.hist(df_students['parental level of education'], bins=6)
plt.xlabel('Education Level')
plt.ylabel('Count')
plt.title('Education Distribution of Parents Education')

plt.show()

## Section 8

This code creates a bar plot using the Seaborn library in Python. The `sns.barplot()` function takes three arguments:

- x: The column name of the categorical variable to be plotted on the x-axis.
- y: The column name of the continuous variable to be plotted on the y-axis.
- data: The name of the DataFrame that contains the data to be plotted.

In this case, the x argument is set to "lunch", which is a categorical variable that represents whether a student receives free/reduced lunch or not. The y argument is set to "math score", which is a continuous variable that represents a student’s math score. Finally, the data argument is set to `df_students`, which is the name of the DataFrame that contains the data.

The resulting plot will show the average math score for students who receive free/reduced lunch and those who do not.

### Bar plots

In [None]:
# Create a bar plot
g = sns.barplot(x="lunch", y="math score", data=df_students)

## Section 9

This code creates a pie chart using the Matplotlib library in Python. The `value_counts()` function is used to count the number of occurrences of each unique value in the "race/ethnicity" column of the DataFrame `df_students`. The resulting counts are stored in the `category_counts` variable.

The `plt.pie()` function takes three arguments:

- x: The values to be plotted. In this case, it is the values of `category_counts`.
- labels: The labels for each value. In this case, it is the index of `category_counts`.
- autopct: A string or function used to format the percentage values.

The resulting pie chart will show the distribution of race/ethnicity in the DataFrame.

### Pie Chart

In [None]:
# Create a pie chart
category_counts = df_students['race/ethnicity'].value_counts()
plt.pie(category_counts.values, labels=category_counts.index, autopct='%1.1f%%')
plt.title('Distribution of Race/Ethnicity')
plt.show()


## Section 10

### Scatter Plots:

The `sns.scatterplot()` function takes three arguments:

- data: The name of the DataFrame that contains the data to be plotted.
- x: The column name of the variable to be plotted on the x-axis.
- y: The column name of the variable to be plotted on the y-axis.

In this case, the data argument is set to `df_pokemon`, which is the name of the DataFrame that contains the data. The x argument is set to `"HP"`, which is a variable that represents a Pokemon’s hit points. The y argument is set to `"Sp. Atk"`, which is another variable that represents a Pokemon’s special attack points.

In [None]:
# Read csv data
df_pokemon = pd.read_csv('pokemon_data.csv')

sns.scatterplot(data = df_pokemon, x = "HP", y = "Sp. Atk")

<span style ="background-color:yellow">
TODO - Using any of the above mentioned graphs spend the next 5 minutes collecting, aggregating, and sorting your data. Then gather the nessessary inputs and creat a graph most useful for a reader trying to interpret your data. Afterwards have a 5 minutes discussion going over every step of the process with another student
<span\>