#  Unit 2.3 Extracting Information from Data, Pandas
> Data connections, trends, and correlation.  Pandas is introduced as it could be valuable for PBL, data validation, as well as understanding College Board Topics.
- toc: true
- categories: collegeboard


# Pandas and DataFrames
> In this lesson we will be exploring data analysis using Pandas.  

- College Board talks about ideas like 
    - Tools. "the ability to process data depends on users capabilities and their tools"
    - Combining Data.  "combine county data sets"
    - Status on Data"determining the artist with the greatest attendance during a particular month"
    - Data poses challenge. "the need to clean data", "incomplete data"


- [From Pandas Overview](https://pandas.pydata.org/docs/getting_started/index.html) -- When working with tabular data, such as data stored in spreadsheets or databases, pandas is the right tool for you. pandas will help you to explore, clean, and process your data. In pandas, a data table is called a DataFrame.


![DataFrame](images/table_dataframe.png)

In [1]:
'''Pandas is used to gather data sets through its DataFrames implementation'''
import pandas as pd 

# importing pandas

# Notes From Lesson 
- Panda is a library in python that allows users to play with dataframe 
- You can use csv files as well as json files
- you can display different data using functions found in pandas

Examples:
- df.mean() - finds the mean
- df.median() - finds the median of the data
- df.max() - finds the max 
- df.min() - finds the minimum

you can also print out certain data only, like I did in my code below


# CPT project integration guide (personal w/chatgpt)

- Read the CSV file using Pandas: Use the read_csv function in Pandas to read the CSV file and store it as a DataFrame.

- Convert the DataFrame to JSON: Use the to_json function in Pandas to convert the DataFrame to JSON format. You can specify the orientation of the JSON file (records, index, columns, values) depending on your needs.

- Create an HTML file: Create an HTML file and include a script tag to load the JSON data. You can use the script tag and set the src attribute to the path of the JSON file.

- Load the JSON data in the HTML file: Use the fetch function in JavaScript to load the JSON data into the HTML file. You can use the fetch function to get the JSON data from the server and parse it using the json function.

- Display the data in the HTML file: Once you have the JSON data, you can use JavaScript to display the data in the HT


![Image](../images/jsonthing.png)



# Collegeboard prep 

1Q: A researcher is analyzing data about students in a school district to determine whether there is a relationship between grade point average and number of absences. The researcher plans on compiling data from several sources to create a record for each student.

The researcher has access to a database with the following information about each student.

- Last name

- First name

Grade level (9, 10, 11, or 12)

Grade point average (on a 0.0 to 4.0 scale)

The researcher also has access to another database with the following information about each student.

First name

Last name

Number of absences from school

Number of late arrivals to school

Upon compiling the data, the researcher identifies a problem due to the fact that neither data source uses a unique ID number for each student. Which of the following best describes the problem caused by the lack of unique ID numbers?

1A: Students who have the same name may be confused with each other.

2Q: A team of researchers wants to create a program to analyze the amount of pollution reported in roughly 3,000 counties across the United States. The program is intended to combine county data sets and then process the data. Which of the following is most likely to be a challenge in creating the program?

(A) A computer program cannot combine data from different files.

(B) Different counties may organize data in different ways.

(C) The number of counties is too large for the program to process.

(D) The total number of rows of data is too large for the program to process.

2A: B

3Q:   A student is creating a Web site that is intended to display information about a city based on a city name that a user enters in a text field. Which of the following are likely to be challenges associated with processing city names that users might provide as input?

Select two answers.

(A) Users might attempt to use the Web site to search for multiple cities.

(B) Users might enter abbreviations for the names of cities.

(C) Users might misspell the name of the city.

(D) Users might be slow at typing a city name in the text field.

3A: B and C

4Q:A database of information about shows at a concert venue contains the following information.

Name of artist performing at the show

Date of show

Total dollar amount of all tickets sold

Which of the following additional pieces of information would be most useful in determining the artist with the greatest attendance during a particular month?

(A) Average ticket price

(B) Length of the show in minutes

(C) Start time of the show

(D) Total dollar amount of food and drinks sold during the show

4A: A

5Q: A camera mounted on the dashboard of a car captures an image of the view from the driver’s seat every second. Each image is stored as data. Along with each image, the camera also captures and stores the car’s speed, the date and time, and the car’s GPS location as metadata. Which of the following can best be determined using only the data and none of the metadata?

(A) The average number of hours per day that the car is in use

(B) The car’s average speed on a particular day

(C) The distance the car traveled on a particular day

(D) The number of bicycles the car passed on a particular day

5A: D

6Q: A teacher sends students an anonymous survey in order to learn more about the students’ work habits. The survey contains the following questions.

On average, how long does homework take you each night (in minutes)?

On average, how long do you study for each test (in minutes)?

Do you enjoy the subject material of this class (yes or no)?

Which of the following questions about the students who responded to the survey can the teacher answer by analyzing the survey results?

I. Do students who enjoy the subject material tend to spend more time on homework each night than the other students do?

II. Do students who spend more time on homework each night tend to spend less time studying for tests than the other students do?

III. Do students who spend more time studying for tests tend to earn higher grades in the class than the other students do?

(A) I only

(B) III only

(C) I and II

(D) I and III

6A: C

# NBA PLAYER Dataset
> Here is a dataset that I found online, which displays the stats of nba players

In [2]:
import pandas as pd

df = pd.read_csv('files/NBA.csv').sort_values(by=['age'], ascending=False)


print("--FG Top 10---------")
print(df.head(10))

# Ask the teacher the question. Df.head only works when using it in this cell block, but when I add df.tail, it doesnt show. I have to put it in a different cell block for it to show. 



--FG Top 10---------
       Unnamed: 0      player_name team_abbreviation   age  player_height  \
4698         4698     Kevin Willis               DAL  44.0         213.36   
270           270    Robert Parish               CHI  43.0         215.90   
10818       10818     Vince Carter               ATL  43.0         198.12   
5680         5680  Dikembe Mutombo               HOU  43.0         218.44   
4892         4892  Dikembe Mutombo               HOU  42.0         218.44   
12149       12149    Udonis Haslem               MIA  42.0         203.20   
10240       10240     Vince Carter               ATL  42.0         198.12   
3795         3795     Kevin Willis               ATL  42.0         213.36   
1291         1291    Herb Williams               NYK  41.0         210.82   
2818         2818    John Stockton               UTA  41.0         185.42   

       player_weight         college country draft_year draft_round  ...  \
4698      111.130040  Michigan State     USA       1984

In [3]:
import pandas as pd

df = pd.read_csv('files/NBA.csv').sort_values(by=['age'], ascending=False)


print("--bottom 10---------")
print(df.tail(10))

--bottom 10---------
       Unnamed: 0      player_name team_abbreviation   age  player_height  \
8488         8488    Bruno Caboclo               TOR  19.0         205.74   
8783         8783    Rashad Vaughn               MIL  19.0         198.12   
12194       12194     Joshua Primo               SAS  19.0         193.04   
4410         4410  Martell Webster               POR  19.0         200.66   
9206         9206  Marquese Chriss               PHX  19.0         208.28   
10523       10523    Kevin Knox II               NYK  19.0         205.74   
11829       11829   Jaden Springer               PHI  19.0         193.04   
342           342      Kobe Bryant               LAL  18.0         200.66   
78             78  Jermaine O'Neal               POR  18.0         210.82   
4138         4138     Andrew Bynum               LAL  18.0         213.36   

       player_weight     college country draft_year draft_round  ...   pts  \
8488       92.986360        None  Brazil       2014  

In [4]:
print("------Here are the players with the highest average points------")
print(df[df.pts == df.pts.max()])
print()



print("------Here are the players with the lowest average points------")
print(df[df.pts == df.pts.min()])
print()




------Here are the players with the highest average points------
       Unnamed: 0   player_name team_abbreviation   age  player_height  \
10572       10572  James Harden               HOU  29.0         195.58   

       player_weight        college country draft_year draft_round  ...   pts  \
10572       99.79024  Arizona State     USA       2009           1  ...  36.1   

       reb  ast  net_rating  oreb_pct  dreb_pct  usg_pct  ts_pct  ast_pct  \
10572  6.6  7.5         6.3     0.023     0.157    0.396   0.616    0.394   

        season  
10572  2018-19  

[1 rows x 22 columns]

------Here are the players with the lowest average points------
       Unnamed: 0               player_name team_abbreviation   age  \
3111         3111            Olden Polynice               LAC  39.0   
2676         2676              Chris Dudley               POR  38.0   
2090         2090             Muggsy Bogues               TOR  36.0   
6721         6721             Brian Skinner               MEM 

In [5]:
print("Average overall stats of the typical player in the NBA")

df.mean()



Average overall stats of the typical player in the NBA


  df.mean()


Unnamed: 0       6152.000000
age                27.084518
player_height     200.611602
player_weight     100.369926
gp                 51.290532
pts                 8.172775
reb                 3.559155
ast                 1.813986
net_rating         -2.255733
oreb_pct            0.054473
dreb_pct            0.141014
usg_pct             0.184891
ts_pct              0.511060
ast_pct             0.131358
dtype: float64

In [6]:
print("------Here are the players with their average number of points over 20------")

print(df[df.pts > 20.00])


------Here are the players with their average number of points over 20------
       Unnamed: 0      player_name team_abbreviation   age  player_height  \
2995         2995      Karl Malone               UTA  39.0         205.74   
2199         2199   Michael Jordan               WAS  39.0         198.12   
2276         2276      Karl Malone               UTA  38.0         205.74   
2079         2079      Karl Malone               UTA  37.0         205.74   
12175       12175     LeBron James               LAL  37.0         205.74   
...           ...              ...               ...   ...            ...   
11365       11365  Zion Williamson               NOP  20.0         200.66   
10451       10451      Luka Doncic               DAL  20.0         200.66   
3120         3120     LeBron James               CLE  19.0         203.20   
10799       10799  Zion Williamson               NOP  19.0         198.12   
5208         5208     Kevin Durant               SEA  19.0         205.74   

In [7]:
print("------These are the median stats by the players in the NBA ------")
df.median()



------These are the median stats by the players in the NBA ------


  df.median()


Unnamed: 0       6152.00000
age                26.00000
player_height     200.66000
player_weight      99.79024
gp                 57.00000
pts                 6.70000
reb                 3.00000
ast                 1.20000
net_rating         -1.30000
oreb_pct            0.04100
dreb_pct            0.13100
usg_pct             0.18100
ts_pct              0.52400
ast_pct             0.10300
dtype: float64