#Introduction to **Pandas** with Girls Who Code
Objectives
1. Introduction to Pandas
2. Data Structures in Pandas
3. Loading and Viewing Data
4. Indexing and Selection
5. Data Manipulation
6. Visualization with Pandas


## 1. Introduction to Pandas

### What is Pandas?
* Pandas is a powerful Python library for data manipulation and analysis
* It provides easy-to-use data structures and functions to work with structured data, primarily in the form of tables (dataframes)

### Why use Pandas for Data Analysis?
1. **Ease of Use**: Simplifies common tasks like loading, cleaning, transforming, and analyzing data.
2. **Flexible Data Structures**: Offers Series (1D) and DataFrame (2D) for easy data handling.
3. **Integration with Other Libraries**: Works seamlessly with Matplotlib, Seaborn, and more.
4. **Rich Functionality**: Provides functions for filtering, grouping, merging, reshaping, and statistical analysis.
5. **Community Support**: Large community with plenty of resources for learning and troubleshooting.

>****[Here's](https://images.datacamp.com/image/upload/v1676302204/Marketing/Blog/Pandas_Cheat_Sheet.pdf) a Pandas resource chart that you can refer to during the workshop.****

### Loading a Library
1. Before we can start using Pandas, we need to **install** it using code. However, **Google Colab already comes with Pandas pre-installed**, so there's no need to install it separately.
2. Directly import Pandas into your Python code using the import statement:

  ``` import [library name] as [alias]```
  * set [library name] to ```pandas```, which is the name of the library being imported.
  * set [alias] to ```pd```, which is an abbreviation or shorthand for pandas.


In [None]:
# Import Pandas library here!
import pandas as pd

## 2. Loading and Viewing Data

###Reading Data from Different Sources (CSV)
* "Reading data" refers to the process of importing structured data from external sources into your programming environment.
* The data may come from various sources such as CSV files, Excel spreadsheets, SQL databases, JSON files, or web APIs.

We'll focus on reading data from CSV (Comma-Separated Values) files using the ```pd.read_csv()``` function provided by the Pandas library.

Example: ```dataset = pd.read_csv('file_path.csv')```

Please download the CSV file using [this link.](https://drive.google.com/file/d/1nUKzmXKcxoNqIH52fRvyMYBMv8y2TRmm/view?usp=sharing)

In [None]:
# Create a DataFrame named 'df' to store the data from the CSV file using 'pd.read_csv()'.
# 'pd.read_csv()' should data from 'file_path.csv' and stores it in 'dataset'.

dataset = pd.read_csv('StudentsPerformance.csv')

###Head, Tail, and Sample Functions###
After loading the data, you can use the following functions to quickly view the first few rows, last few rows, or a random sample of the DataFrame:

* head()
* tail()
* sample()

Format:
```[data_frame_name].[function]```

In [None]:
# Test out using the head function to see the first 5 rows of the data frame
print(dataset.head(10))

   gender race/ethnicity parental level of education         lunch  \
0  female        group B           bachelor's degree      standard   
1  female        group C                some college      standard   
2  female        group B             master's degree      standard   
3    male        group A          associate's degree  free/reduced   
4    male        group C                some college      standard   
5  female        group B          associate's degree      standard   
6  female        group B                some college      standard   
7    male        group B                some college  free/reduced   
8    male        group D                 high school  free/reduced   
9  female        group B                 high school  free/reduced   

  test preparation course  math score  reading score  writing score  
0                    none          72             72             74  
1               completed          69             90             88  
2                  

In [None]:
# Test out using the tail function to see the last 5 rows of the data frame
print(dataset.tail())

     gender race/ethnicity parental level of education         lunch  \
995  female        group E             master's degree      standard   
996    male        group C                 high school  free/reduced   
997  female        group C                 high school  free/reduced   
998  female        group D                some college      standard   
999  female        group D                some college  free/reduced   

    test preparation course  math score  reading score  writing score  
995               completed          88             99             95  
996                    none          62             55             55  
997               completed          59             71             65  
998               completed          68             78             77  
999                    none          77             86             86  


In [None]:
# Test out the sample function to see a random sample of rows of the data frame
print(dataset.sample())

    gender race/ethnicity parental level of education         lunch  \
301   male        group D            some high school  free/reduced   

    test preparation course  math score  reading score  writing score  
301                    none          56             54             52  


### Basic Data Exploration Techniques
Once the data is loaded, you can perform basic exploration to understand its structure and characteristics. Some common exploration techniques include:

* info(): Provides a concise summary of the DataFrame, including its data types, non-null values, and memory usage.
* describe(): Generates descriptive statistics of numerical columns in the DataFrame, such as count, mean, standard deviation, min, max, etc.
* shape: Returns the dimensions of the DataFrame (number of rows and columns).
* columns: Returns the column names of the DataFrame.
* dtypes: Returns the data types of each column in the DataFrame.
* ```iloc```: Extract specific rows using a given index or a range of indices.

In [None]:
# Try out some!
# Be sure to using print()

# Getting a concise summary of the DataFrame
print("Summary:")
print(dataset.info())

# Generating descriptive statistics of numerical columns
print("\nDescriptive Statistics:")
print(dataset.describe())

# Getting the dimensions of the DataFrame
print("\nDimensions:")
print(dataset.shape)

# Getting the column names of the DataFrame
print("\nColumn Names:")
print(dataset.columns)

# Getting the data types of each column
print("\nData Types:")
print(dataset.dtypes)

# Extract rows 892 to 900
print("\nRows 892 to 900:")
print(dataset.iloc[892: 900])

Summary:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   gender                       1000 non-null   object
 1   race/ethnicity               1000 non-null   object
 2   parental level of education  1000 non-null   object
 3   lunch                        1000 non-null   object
 4   test preparation course      1000 non-null   object
 5   math score                   1000 non-null   int64 
 6   reading score                1000 non-null   int64 
 7   writing score                1000 non-null   int64 
dtypes: int64(3), object(5)
memory usage: 62.6+ KB
None

Descriptive Statistics:
       math score  reading score  writing score
count  1000.00000    1000.000000    1000.000000
mean     66.08900      69.169000      68.054000
std      15.16308      14.600192      15.195657
min       0.00000      17.000000      10.00

## 3. Data Structures in Pandas




During data collection, we usually use spreadsheets to tabulate our data. However, these formats need to be converted into some sort of data structure that our computer can read and manipulate.

In Pandas, we use DataFrames to store our data. DataFrames are pretty much tables that we can use to store our data in csv files. They are extremely powerful for creating visualiations, ML models, and in other data science applications.

### type()
type() is a really helpful function that returns the datatype of whatever you put in as a parameter. Let's test it out below!


In [None]:
print(type(dataset["reading score"]))

<class 'pandas.core.series.Series'>


### Extracting DataFrames and Series


In [None]:
math_scores = dataset["math score"]

# we can check to see if we extracted our data correctly using .head()
math_scores

0      72
1      69
2      90
3      47
4      76
       ..
995    88
996    62
997    59
998    68
999    77
Name: math score, Length: 1000, dtype: int64

Try extracting all of the scores from ```dataset``` and display the first few rows using .head(). Then double check your extracted data is a DataFrame using the type() function.


In [None]:
# extract and display the head of all of the scores ( try this one by yourself :) )
all_scores = dataset[["math score", "reading score", "writing score"]]
all_scores

Unnamed: 0,math score,reading score,writing score
0,72,72,74
1,69,90,88
2,90,95,93
3,47,57,44
4,76,78,75
...,...,...,...
995,88,99,95
996,62,55,55
997,59,71,65
998,68,78,77


## 4. Indexing

We can use indexing to get certain rows of the data. This is called "slicing" the dataset.

The left side of the printed dataframe shows row numbers starting from 0. These are the row indexes. It's possible to change them to be other values, like words/strings, but they generally start from 0.

To specify a particular row, just give its index value. If you want a particular range of rows, use this notation:
```dataset.iloc[starting number : one more than your stopping number]```



---


Two useful functions for indexing and slicing data are iloc and loc (loc as in "location"). iloc is useful for extracting specific rows using a given index or a range of indices. loc is useful for extracting columns and rows using a given range of indices.

Both of these functions can create both Series and DataFrames, depending on how many rows you want to slice. Let's practice using them in the cell below:

In [None]:
# iloc
# extracting one row
d1 = dataset.iloc[15]
d1

gender                                   female
race/ethnicity                          group C
parental level of education    some high school
lunch                                  standard
test preparation course                    none
math score                                   69
reading score                                75
writing score                                78
Name: 15, dtype: object

In [None]:
# extracting multiple rows
d2 = dataset.iloc[100:110]
d2

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
100,male,group B,some college,standard,none,79,67,67
101,male,group D,bachelor's degree,standard,completed,68,74,74
102,female,group D,associate's degree,standard,none,85,91,89
103,male,group B,high school,standard,completed,60,44,47
104,male,group C,some college,standard,completed,98,86,90
105,female,group C,some college,standard,none,58,67,72
106,female,group D,master's degree,standard,none,87,100,100
107,male,group E,associate's degree,standard,completed,66,63,64
108,female,group B,associate's degree,free/reduced,none,52,76,70
109,female,group B,some high school,standard,none,70,64,72


In [None]:
# loc
# extracting columns and rows
d3 = dataset.loc[890:900, ["lunch", "reading score", "writing score"]]
d3

Unnamed: 0,lunch,reading score,writing score
890,standard,85,91
891,standard,92,85
892,free/reduced,67,73
893,standard,74,75
894,standard,62,69
895,free/reduced,34,38
896,free/reduced,29,27
897,free/reduced,78,79
898,standard,54,63
899,standard,78,82


Try practicing iloc and loc in the cells below by yourself:

In [None]:
# let's use iloc to get row 400 of dataset
d4 = dataset.iloc[399]

In [None]:
# now let's try to get a few consecutive rows (i.e rows 500 to 600. remember that your stopping index is NOT inclusive!)
d5 = dataset.iloc[500:600]
d5

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
500,female,group D,master's degree,standard,none,74,79,82
501,female,group B,associate's degree,standard,completed,94,87,92
502,male,group C,some college,free/reduced,none,63,61,54
503,female,group E,associate's degree,standard,completed,95,89,92
504,female,group D,master's degree,free/reduced,none,40,59,54
...,...,...,...,...,...,...,...,...
595,female,group C,bachelor's degree,standard,completed,56,79,72
596,male,group B,high school,free/reduced,none,30,24,15
597,male,group A,some high school,standard,none,53,54,48
598,female,group D,high school,standard,none,69,77,73


In [None]:
# you're doing great bestie keep it up
# ok we're getting adventurous now, let's print out gender and ethnicity of each person in rows 300 to 400 (remember that your stopping index is NOT inclusive!)
d6 = dataset.loc[300:400, ["race/ethnicity", "gender"]]
d6

Unnamed: 0,race/ethnicity,gender
300,group A,male
301,group D,male
302,group C,female
303,group B,male
304,group C,female
...,...,...
396,group B,female
397,group C,female
398,group B,male
399,group D,male


In [None]:
# yay! if you want to test more things out with loc and iloc below, go right ahead!


#### Resetting Indices

Notice how the index values on the left side haven't reordered so the first row is row 0. That is silly, so let's fix it!

In [None]:
# use reset_index to reset the indicies of d3, and display the head
d3.reset_index(drop=True,inplace=True)
d3

Unnamed: 0,lunch,reading score,writing score
0,standard,85,91
1,standard,92,85
2,free/reduced,67,73
3,standard,74,75
4,standard,62,69
5,free/reduced,34,38
6,free/reduced,29,27
7,free/reduced,78,79
8,standard,54,63
9,standard,78,82


Mmm yes much better.

## 5. Data Manipulation

### Basic Operations
Basic arithmetic operations can be used in conjunction with conditional selection:

With these operations, you can use conditional selection to find rows where:
* ```+``` the sum of two columns meets certain criteria.
* ``` - ``` the difference between two columns meets certain criteria.
* ``` * ``` the product of two columns meets certain criteria.
* ``` / ``` the division of two columns meets certain criteria.


In [None]:
# Find rows where the average of column 'math score', column 'reading score', and column 'writing score' is greater than 90.

# Calculate the average of 'math score', 'reading score', and 'writing score'
dataset['average_score'] = (dataset['math score'] + dataset['reading score'] + dataset['writing score']) / 3

# Filter rows where the average score is greater than 90
filtered_dataset = dataset[dataset['average_score'] > 90]

# Print the filtered DataFrame
filtered_dataset

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score,average_score
2,female,group B,master's degree,standard,none,90,95,93,92.666667
6,female,group B,some college,standard,completed,88,95,92,91.666667
104,male,group C,some college,standard,completed,98,86,90,91.333333
106,female,group D,master's degree,standard,none,87,100,100,95.666667
114,female,group E,bachelor's degree,standard,completed,99,100,100,99.666667
121,male,group B,associate's degree,standard,completed,91,89,92,90.666667
122,female,group C,some college,standard,completed,88,93,93,91.333333
149,male,group E,associate's degree,free/reduced,completed,100,100,93,97.666667
165,female,group C,bachelor's degree,standard,completed,96,100,100,98.666667
179,female,group D,some high school,standard,completed,97,100,100,99.0


### Applying Functions to Data
Pandas allows applying functions to each element or along an axis of the DataFrame or Series.
* ```.mean()```
* ```.median()```
* ```.mode()```

In [None]:
# Print the mean writing score
print(dataset['writing score'].mean())
# Print the median writing score
print(dataset['writing score'].median())

# Print the mode writing score
print(dataset['writing score'].mode())


68.054
69.0
0    74
Name: writing score, dtype: int64


## 6. Visualization with Pandas

### Box Plot
Concept: Box plots display the distribution of data based on five summary statistics: minimum, first quartile, median, third quartile, and maximum.

Key words:
* ```plt.title```: This function sets the title of the plot.
* ```plt.show()```: This function displays the plot you've created.

In [None]:
import matplotlib.pyplot as plt
dataset.boxplot(column='reading score')
plt.title('Box Plot of Reading Score')
plt.show()

### Histogram
* Displays the distribution of writing scores in the dataset.
* Represents the frequency of writing scores within specified score intervals (bins).

Key words:
* ```kind```: This parameter specifies the type of plot you want to create. In this case, kind='hist' indicates that you want to create a histogram.
* ```bins```: This parameter specifies the number of bins (or intervals) into which the data will be divided in the histogram. In your code, bins=10 means you want to divide the data into 10 bins.
* ```plt.title```: This function sets the title of the plot.
* ```plt.xlabel```: This function sets the label for the x-axis of the plot.
* ```plt.ylabel```: This function sets the label for the y-axis of the plot.
* ```plt.show()```: This function displays the plot you've created.

In [None]:
import matplotlib.pyplot as plt
# Fill in a column into XXXX
dataset['XXXX'].plot(kind='hist', bins=10, edgecolor='black')
plt.title('XXXX')
plt.xlabel('XXXX')
plt.ylabel('XXXX')
plt.show()

### Pie Chart

**Concept**:
* Represents the distribution of categorical column in the dataset.
* Each slice of the pie represents a proportion of type of students.
* The size of each slice corresponds to the percentage of students with that gender.

Key words:
* ```kind```: This parameter specifies the type of plot you want to create. In this case, kind='pie' indicates that you want to create a pie chart.
* ```autopct```: This parameter controls the formatting of the percentages displayed on each wedge of the pie chart. '%1.1f%%' formats the percentage with one decimal place.
* ```startangle```: This parameter specifies the angle at which the first wedge of the pie chart is drawn. In this case, startangle=90 sets the starting angle to be vertical, with the first wedge pointing upwards.
* ```plt.title```: This function sets the title of the plot.
* ```plt.axis```: This function adjusts the aspect ratio of the plot. Setting it to 'equal' ensures that the pie chart is drawn as a circle.
* ```plt.show()```: This function displays the plot you've created.

In [None]:
import matplotlib.pyplot as plt

# Create a gender dataset that counts 'male' and 'female' values
XXXX = dataset['XXXX'].value_counts()

# Plotting a pie chart
XXXX.plot(kind='pie', autopct='%1.1f%%', startangle=90)
plt.title('XXXX')
plt.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.
plt.show()

### Scatterplot

**Concept**: Shows relationship between two numerical columns.
Each dot represents a student.

Regression Line: Best-fitting straight line indicating overall trend.
* Slope shows relationship direction:
* Upwards: Positive relationship
* Downwards: Negative relationship

Key words:
* ```sns.regplot()```: This function creates a scatter plot with a linear regression line fit to the data. It's used for visualizing the relationship between two variables.
* ```plt.title()```: This function sets the title of the plot.
* ```plt.xlabel()```: This function sets the label for the x-axis of the plot.
* ```plt.ylabel()```: This function sets the label for the y-axis of the plot.
* ```plt.show()```: This function displays the plot you've created.









In [None]:
# import seaborn as sns
import XXXX as XXXX

# Select the relevant columns for correlation analysis from df
correlation_dataset = dataset[['XXXX', 'XXXX']]

# use sns.regplot() to create the regression plot
# x specifies the column to plot on the x-axis
# y specifies the column to plot on the y-axis
sns.regplot(x='XXXX', y='XXXX', data=XXXX)

# set the title using plt.title()
plt.title(XXXX)

# set x-label using plt.xlabel
plt.xlabel(XXXX)

# set y-label using plt.ylabel
plt.ylabel(XXXX)

# show the plot
plt.show()