#  Unit 2.3 Extracting Information from Data
> Data connections, trends, and correlation.  Pandas is introduced as it could valuable for PBL, data validation, as well as understanding College Board Topics.
- toc: true
- image: /images/python.png
- categories: []
- type: ap
- week: 25

# Files To Get

Save this file to your **_notebooks** folder

wget https://raw.githubusercontent.com/nighthawkcoders/APCSP/master/_notebooks/2023-03-06-AP-unit2_3.ipynb

Save these files into a subfolder named **files** in your **_notebooks** folder

wget https://raw.githubusercontent.com/nighthawkcoders/APCSP/master/_notebooks/files/data.csv

wget https://raw.githubusercontent.com/nighthawkcoders/APCSP/master/_notebooks/files/grade.json

Save this image into a subfolder named **images** in your **_notebooks** folder

wget https://raw.githubusercontent.com/nighthawkcoders/APCSP/master/_notebooks/images/table_dataframe.png


# Pandas and DataFrames
> In this lesson we will be exploring data using Pandas.  [From Pandas Overview](https://pandas.pydata.org/docs/getting_started/index.html) -- When working with tabular data, such as data stored in spreadsheets or databases, pandas is the right tool for you. pandas will help you to explore, clean, and process your data. In pandas, a data table is called a DataFrame.


![DataFrame](images/table_dataframe.png)

In [None]:
'''Pandas is used to gather data sets through its DataFrames implementation'''
import pandas as pd

# Cleaning Data

When looking at a data set, check to see what data needs to be cleaned. Examples include:
- Missing Data Points
- Invalid Data
- Inaccurate Data

Run the following code to see what needs to be cleaned

In [None]:
# reads the JSON file and converts it to a Pandas DataFrame
df = pd.read_json('files/grade.json')

print(df)
# What part of the data set needs to be cleaned?
# From PBL learning, what is a good time to clean data?  Hint, remember Garbage in, Garbage out?

# Extracting Info

Take a look at some features that the Pandas library has that extracts info from the dataset

## DataFrame Extract Column

In [None]:
#print the values in the points column with column header
print(df[['GPA']])

print()

#try two columns and remove the index from print statement
print(df[['Student ID','GPA']].to_string(index=False))

## DataFrame Sort

In [None]:
#sort values
print(df.sort_values(by=['GPA']))

print()

#sort the values in reverse order
print(df.sort_values(by=['GPA'], ascending=False))

## DataFrame Selection or Filter

In [None]:
#print only values with a specific criteria 
print(df[df.GPA > 3.00])

## DataFrame Selection Max and Min

In [None]:
print(df[df.GPA == df.GPA.max()])
print()
print(df[df.GPA == df.GPA.min()])

# Create your own DataFrame

Using Pandas allows you to create your own DataFrame in Python.

## Python Dictionary to Pandas DataFrame

In [None]:
import pandas as pd

#the data can be stored as a python dictionary
dict = {
  "calories": [420, 380, 390],
  "duration": [50, 40, 45]
}
#stores the data in a data frame
print("-------------Dict_to_DF------------------")
df = pd.DataFrame(dict)
print(df)

print("----------Dict_to_DF_labels--------------")

#or with the index argument, you can label rows.
df = pd.DataFrame(dict, index = ["day1", "day2", "day3"])
print(df)

## Examine DataFrame Rows

In [None]:
print("-------Examine Selected Rows---------")
#use a list for multiple labels:
print(df.loc[["day1", "day3"]])

#refer to the row index:
print("--------Examine Single Row-----------")
print(df.loc["day1"])

## Pandas DataFrame Information

In [None]:
#print info about the data set
print(df.info())

# Example of larger data set

Pandas can read CSV and many other types of files, run the following code to see more features with a larger data set

In [None]:
import pandas as pd

#read csv and sort 'Duration' largest to smallest
df = pd.read_csv('files/data.csv').sort_values(by=['Duration'], ascending=False)

print("--Duration Top 10---------")
print(df.head(10))

print("--Duration Bottom 10------")
print(df.tail(10))


# Hacks

- Create or Find your own dataset using a JSON file, integrating with your PBL project would be Kudos
- Extract info from that database (ex. max, min, mean, median, mode, etc.) using Pandas functions 
- Answer College Board practice problems for 2.3

## Hack Helpers
> Here is how Mort started on this assignment by asking ChatGPT  ... Regarding Python Pandas, what are some data sets that would be good for learning pandas

- There are many data sets that are suitable for learning pandas, depending on your interests and the skills you want to develop. Here are some suggestions:

    - Titanic Dataset: This is a classic dataset for data analysis and machine learning, and is often used as an introduction to pandas. The dataset contains information about passengers on the Titanic, including their demographics, ticket class, and survival status.

    - Iris Dataset: This is another classic dataset that is often used in machine learning and data analysis courses. The dataset contains information about iris flowers, including their sepal length, sepal width, petal length, and petal width.

    - NBA Player Stats: This dataset contains information about NBA players, including their stats for various games, their positions, and their salaries. It is a great dataset for learning how to clean and manipulate data using pandas.

    - Housing Prices Dataset: This dataset contains information about housing prices in a particular city or region, including variables such as the number of bedrooms, square footage, and price. It is a great dataset for learning how to manipulate and analyze numerical data using pandas.

    - COVID-19 Dataset: This dataset contains information about the COVID-19 pandemic, including the number of cases, deaths, and recoveries in various countries and regions. It is a great dataset for learning how to manipulate and analyze time-series data using pandas.

    - World Bank Dataset: This dataset contains information about economic indicators from various countries, such as GDP, population, and poverty rates. It is a great dataset for learning how to manipulate and analyze large datasets using pandas.

- These are just a few examples of the many datasets that are suitable for learning pandas. The best dataset for you will depend on your interests and the specific skills you want to develop.


> Follow up question, "where can I find Titanic data set"

- The Titanic dataset is a popular dataset for data analysis and machine learning, and it can be found on various websites and data repositories. Here are a few places where you can find the Titanic dataset:

    - Kaggle: The Titanic dataset is available on Kaggle, which is a popular platform for data scientists and machine learning enthusiasts. You can download the dataset from the Kaggle website after creating an account.

    - UCI Machine Learning Repository: The Titanic dataset is also available on the UCI Machine Learning Repository, which is a collection of datasets that are commonly used for machine learning research. You can download the dataset from the UCI Machine Learning Repository website.

    - Seaborn library: If you have the Seaborn library installed in your Python environment, you can load the Titanic dataset directly from the library using the following code:

    ```python
    import seaborn as sns
    titanic_data = sns.load_dataset('titanic')
    ```

- From this Mort was able to build a query, after doing `pip install seaborn`.   Teachers are looking for a little analysis on this type of data.

In [None]:
import seaborn as sns

# Load the titanic dataset
td = sns.load_dataset('titanic')

print(td.columns) # titanic data set

print(td[['who', 'pclass', 'sex', 'age']]) # look at selected columns
