# Dataframes and Data Visualization Part 2

## Basic summary statistics

Basic summary statistics provide an overview of the central tendency, dispersion, and shape of a variable.
Python's pandas library offers a convenient method called describe() to generate summary statistics for a DataFrame.

In this example, we import the pandas library and load the data from a CSV file into a DataFrame named data. The describe()
method is then called on the DataFrame to calculate summary statistics such as count, mean, standard deviation, minimum, quartiles,
and maximum for each numerical column in the data. The result is stored in the summary variable and printed.

In [None]:
import pandas as pd

# Load the data into a DataFrame
df_students = pd.read_csv('students_performance.csv')

# Generate summary statistics
summary = df_students.describe()
print(summary)

<span style ="background-color:yellow">
TODO: Print out the statistical summary for your data set and interpret what the values represent to your classmates. Based of this data what are some expectations you have for the shape of your data?
</span>

## Counting Missing Values and Identifying Outliers

Missing values and outliers are important considerations in data analysis. Python's pandas library provides functions to count missing values and identify outliers.

To count missing values:

In [None]:
# Count missing values in each column
missing_values = df_students.isnull().sum()
print(missing_values)

The isnull() function checks each element in the DataFrame and returns True if it is missing (NaN) and False otherwise. By calling sum() on the result, we obtain the count of missing values for each column.

Sometimes, data may have outliers. An outlier is a data point that differs significantly from other data points. It's existence may skew our data and impact other descriptive statistic values to the point where any conclusion we make from them are not accurate. We can identify outliers using the z-score method. A z-score is a statistical measurement that describes how far a data point is from the mean of the data. We can use a z-score as a cutoff point for reasonable data. In this example, we will use 3 as the cutoff.

To identify outliers using a z-score method:

In [None]:
from scipy import stats

# Calculate z-scores for each value in a column
z_scores = stats.zscore(df_students['math score'])

# Identify outliers using a z-score threshold
threshold = 3
outliers = df_students[abs(z_scores) > threshold]
print(outliers)

In this example, we use the zscore() function from the scipy.stats module to calculate the z-scores for each value in a specific column, where column_name represents the column of interest. We then define a threshold (e.g., 3) to identify outliers. Values with z-scores exceeding the threshold are considered outliers, and we use the abs() function to take the absolute values of the z-scores. Finally, we create a DataFrame named outliers containing the rows with identified outliers and print it.

## Examining the Distribution of Variables

Understanding the distribution of variables helps in assessing their shape, skewness, and potential data transformations. Python offers several visualization libraries to examine variable distributions. Let's use matplotlib to create a histogram and seaborn to generate a box plot.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Plot a histogram of a variable
plt.hist(df_students['math score'], bins=20)

# We use the .xlabel() function to give the x-axis a label
plt.xlabel('Variable')

# We use the .ylabel() function to give the y-axis a label
plt.ylabel('Frequency')

# We use the .title() function to give the histogram a title
plt.title('Histogram of Variable')

#displays the plot
plt.show()

In [None]:
# Generate a box plot of a variable
sns.boxplot(x=df_students['math score'])
plt.xlabel('Variable')
plt.title('Box Plot of Variable')
plt.show()

In this example, we utilize sns.boxplot() from the seaborn library to create a box plot of the variable variable in the DataFrame data. The x parameter is set to the column of interest. Similarly, xlabel() and title() are used to label the x-axis and provide a title to the plot. Finally, plt.show() displays the box plot.

<span style ="background-color:yellow">
TODO: Examine your data and find out if their are any outliers or null values. Next create a boxplot and a histogram for your datasets. Once completed converse with another student talking about how any outliers or null values may have impacted your graphs. Next talk about any observations you had of your graphs and anything that popped out at you.
<span\>

## Loading data into Pandas

If you are using repl, copy the contents of the csv into your repl and give it the same name of `pokemon_data.csv`

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

### Histograms, Bar Plots, and Pie Charts for Categorical Variables

Histograms divide the data into bins, but to find the amount of bins could prove tricky while using a large dataset


In [None]:
#To find the amount of bins you can use the value_counts() function

df_student_vc = df_students['parental level of education'].value_counts()
print(df_student_vc)

This code creates a histogram using the Matplotlib library in Python. The plt.hist() function is used to create the plot with the following arguments:

- x: The data to be plotted. In this case, it is the "parental level of education" column of the DataFrame df_students.
- bins: The number of bins to use in the histogram.

In this case, the x argument is set to "parental level of education", which is a categorical variable that represents the education level of a student’s parents. The bins argument is set to 6, which means that the histogram will have 6 bins.

The plt.xlabel(), plt.ylabel(), and plt.title() functions are then used to add labels and a title to the plot.

In [None]:
# Create a histogram
plt.hist(df_students['parental level of education'], bins=6)
plt.xlabel('Education Level')
plt.ylabel('Count')
plt.title('Education Distribution of Parents Education')

plt.show()

This code creates a bar plot using the Seaborn library in Python. The sns.barplot() function takes three arguments:

x: The column name of the categorical variable to be plotted on the x-axis.
y: The column name of the continuous variable to be plotted on the y-axis.
data: The name of the DataFrame that contains the data to be plotted.
In this case, the x argument is set to "lunch", which is a categorical variable that represents whether a student receives free/reduced lunch or not. The y argument is set to "math score", which is a continuous variable that represents a student’s math score. Finally, the data argument is set to df_students, which is the name of the DataFrame that contains the data.

The resulting plot will show the average math score for students who receive free/reduced lunch and those who do not.

In [None]:
# Create a bar plot
g = sns.barplot(x="lunch", y="math score", data=df_students)

This code creates a pie chart using the Matplotlib library in Python. The value_counts() function is used to count the number of occurrences of each unique value in the "race/ethnicity" column of the DataFrame df_students. The resulting counts are stored in the category_counts variable.

The plt.pie() function takes three arguments:

- x: The values to be plotted. In this case, it is the values of category_counts.
- labels: The labels for each value. In this case, it is the index of category_counts.
- autopct: A string or function used to format the percentage values.

The resulting pie chart will show the distribution of race/ethnicity in the DataFrame.

In [None]:
# Create a pie chart
category_counts = df_students['race/ethnicity'].value_counts()
plt.pie(category_counts.values, labels=category_counts.index, autopct='%1.1f%%')
plt.title('Distribution of Race/Ethnicity')
plt.show()

### Scatter Plots:

The sns.scatterplot() function takes three arguments:

- data: The name of the DataFrame that contains the data to be plotted.
- x: The column name of the variable to be plotted on the x-axis.
- y: The column name of the variable to be plotted on the y-axis.

In this case, the data argument is set to df_pokemon, which is the name of the DataFrame that contains the data. The x argument is set to "HP", which is a variable that represents a Pokemon’s hit points. The y argument is set to "Sp. Atk", which is another variable that represents a Pokemon’s special attack points.

In [None]:
# Read csv data
df_pokemon = pd.read_csv('pokemon_data.csv')

sns.scatterplot(data = df_pokemon, x = "HP", y = "Sp. Atk")

<span style ="background-color:yellow">
TODO - Using any of the above mentioned graphs spend the next 5 minutes collecting, aggregating, and sorting your data. Then gather the nessessary inputs and creat a graph most useful for a reader trying to interpret your data. Afterwards have a 5 minutes discussion going over every step of the process with another student
<span\>