## Lab Grading Summary

- Your Grade : 5
- Graded By  : Professor Nosky

###  Feedback to Learner  



1. Code Complete means all coding sections have an honest attempt to code the problem at hand. Please note this does not imply the code is correct but it should be. Reminder, the lab advice has the answers so there is no reason why you don't have something that works. If your code is not correct, I expect it to look like the video and include an adequate reflection in the metacognition along with the research you did to solve it and a question.
2. Cells Executed means ALL code cells in the lab display evidence they were executed in your lab submission.(This includes running the checker if one is available at the end of your lab.) Reminder: If you do a kernel reset you will have to rerun all cells top to bottom. 
3. Reflection/Metacognition Complete means you made an honest effort to assess your level of understanding and followed it up with answers at the end of the lab adequately conveying what you have learned and what still confuses you. This should be evident in the work you have done to complete the lab. If something doesn't work be very clear about your understanding of why it didn't work. Don't just include the error message, tell me what the message means as it relates to your code. Code that doesn't work with a cognition level of 3 or 4 makes no sense. 
 

### Rubric

- All 3 criteria met ==> 5
- 2 criteria met ==> 3
- 1 criteria met ==> 1
- Less than 1 criteria met and all late work ==> 0



# In-Class Coding Lab: Transformations with Pandas

This lab will explore some **Stocks** data as retrieved from **Yahoo Finance** in March of 2024. All of the data you will nee can be found in the `stocks` folder where you found this lab.

The emphasis of this lab is not data analysis per-se but instad how to deal with complex data sets, specifically:

 - reading data in JSON format
 - scraping HTML table data from the web
 - combining data sets using `concat()`
 - connecting data sets on a common column using `merge()`
 - custom operations using `apply()`


In [5]:
import pandas as pd
import numpy as np
import json
from IPython.display import display
# this turns off warning messages
import warnings
warnings.filterwarnings('ignore')

## Reading in JSON data

The preferred method of reading in JSON data into a Pandas DataFrame is to deserialize the data with the `json` library and then use `pd.json_normalize()` to further process the data. As we saw in the reading for this week `json_normalize()` is quite powerful for handling the JSON format and has many options.  

If you observe the `stocks/company-info.json` file, you will see the JSON is *nested*. For example the `city` key is under the `info` key.

```
[
    {
        "symbol": "X",
        "name": "United States Steel Corporation",
        "exchange": "NYQ",
        "industry": "Steel",
        "sector": "Basic Materials",
        "info": {
            "website": "https://www.ussteel.com",
            "city": "PA",
            "state": "Pittsburgh",
            "country": "United States"
        }
    },
    ...
```

`json_normalize()` can handle nested JSON easily. 

### Why is nested JSON a problem?

run this code to read in the `company-info`:

In [6]:
companies = pd.read_json("stocks/company-info.json")
companies[['info']].head()

Unnamed: 0,info
0,"{'website': 'https://www.ussteel.com', 'city':..."
1,"{'website': 'https://www.gm.com', 'city': 'Det..."
2,"{'website': 'https://www.apple.com', 'city': '..."
3,"{'website': 'https://www.aboutamazon.com', 'ci..."
4,"{'website': 'https://investor.fb.com', 'city':..."


In [7]:
companies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15 entries, 0 to 14
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   symbol    15 non-null     object
 1   name      15 non-null     object
 2   exchange  15 non-null     object
 3   industry  15 non-null     object
 4   sector    15 non-null     object
 5   info      15 non-null     object
dtypes: object(6)
memory usage: 852.0+ bytes


See the problem here? the `info` key in the JSON has 4 key-values. These are not accessible as the `read_json()` function does not inspect inside the keys for other nested JSON.

This means the values `website`, `city`, `state` and `country` are not accessible. :-(

### json_normalize() to the rescue!

By default `json_normalize()` will flatten the schema. It takes some extra work because you can't use it from a file.


In [8]:
with open("stocks/company-info.json", "r") as f:
    data = json.load(f)

companies = pd.json_normalize(data)
companies.head()

Unnamed: 0,symbol,name,exchange,industry,sector,info.website,info.city,info.state,info.country
0,X,United States Steel Corporation,NYQ,Steel,Basic Materials,https://www.ussteel.com,Pittsburgh,PA,United States
1,GM,General Motors Company,NYQ,Auto Manufacturers,Consumer Cyclical,https://www.gm.com,Detroit,MI,United States
2,AAPL,Apple Inc.,NMS,Consumer Electronics,Technology,https://www.apple.com,Cupertino,CA,United States
3,AMZN,"Amazon.com, Inc.",NMS,Internet Retail,Consumer Cyclical,https://www.aboutamazon.com,Seattle,WA,United States
4,META,"Meta Platforms, Inc.",NMS,Internet Content & Information,Communication Services,https://investor.fb.com,Menlo Park,CA,United States


In [9]:
companies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15 entries, 0 to 14
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   symbol        15 non-null     object
 1   name          15 non-null     object
 2   exchange      15 non-null     object
 3   industry      15 non-null     object
 4   sector        15 non-null     object
 5   info.website  15 non-null     object
 6   info.city     15 non-null     object
 7   info.state    15 non-null     object
 8   info.country  15 non-null     object
dtypes: object(9)
memory usage: 1.2+ KB


### 1.1 You Code

To demonstrate the nested values are available, use pandas filters to display these columns:

    - symbol
    - name
    - info.state
    
for only those companies in California `'CA'` as the boolean index.

Place the results in a separate dataframe variable and then display it.

In [10]:
# todo write code here

ca_companies = companies[companies['info.state'] == 'CA']
ca_companies_filtered = ca_companies[['symbol', 'name', 'info.state']]
display(ca_companies_filtered)


Unnamed: 0,symbol,name,info.state
2,AAPL,Apple Inc.,CA
4,META,"Meta Platforms, Inc.",CA
5,GOOG,Alphabet Inc.,CA
6,TTD,"The Trade Desk, Inc.",CA
10,NET,"Cloudflare, Inc.",CA
11,NFLX,"Netflix, Inc.",CA


## Simple web scraping with Pandas

The pandas `read_html(url)` method function allows us to read all the HTML tables on the webpage at the provided `url`. This is a quick a easy method of *web scraping* (parsing content from the web).

`read_html()` will return a list of every HTML table on the page. It's then up to us to figure out which one in the list is the one we want. 


### Example:

For example, visit this page in your web browser: [https://en.wikipedia.org/wiki/Display_resolution](https://en.wikipedia.org/wiki/Display_resolution)

About 1/2 down the page, there is a section titled **Common Display Resolutions** and within this section there is a data table. Let's capture this table in Pandas using code.

This code will read every table on the webpage, making a Python `list`:


In [11]:
tables = pd.read_html("https://en.wikipedia.org/wiki/Display_resolution")

Let's iterate over the tables printing the index and the table itself. This makes it easier to find the table we want from the webpage. To get the index while we loop, we use the `enumerate()` function which returns the item and its index.

In [12]:
for index, table in enumerate(tables):
    print("INDEX:", index)
    print("TABLE:")
    display(table.head(5))

INDEX: 0
TABLE:


Unnamed: 0,0,1
0,,This section does not cite any sources. Please...


INDEX: 1
TABLE:


Unnamed: 0,0,1
0,,This section does not cite any sources. Please...


INDEX: 2
TABLE:


Unnamed: 0,Standard,Aspect ratio,Width (px),Height (px),Megapixels,Steam[6] (%),StatCounter[7] (%)
0,nHD,16:9,640.0,360.0,0.23,,0.47
1,VGA,4:3,640.0,480.0,0.307,,
2,SVGA,4:3,800.0,600.0,0.48,,0.76
3,XGA,4:3,1024.0,768.0,0.786,0.38,2.78
4,WXGA,16:9,1280.0,720.0,0.922,0.36,4.82


INDEX: 3
TABLE:


Unnamed: 0,vteComputer display standards,vteComputer display standards.1
0,PC-compatible video hardware,MDA (1981) CGA (1981) HGC (1982) Plantronics (...
1,Standard display resolutions,160×120 320×200 640×200 640×350 640×480 720×34...
2,Widescreen display resolutions,240×160 320×240 432×240 480×270 480×320 640×40...


INDEX: 4
TABLE:


Unnamed: 0,vteData compression methods,vteData compression methods.1
0,Lossless,Entropy type Adaptive coding Arithmetic Asymme...
1,Entropy type,Adaptive coding Arithmetic Asymmetric numeral ...
2,Dictionary type,Byte pair encoding Lempel–Ziv 842 LZ4 LZJB LZO...
3,Other types,BWT CTW CM Delta Incremental DMC DPCM Grammar ...
4,Hybrid,LZ77 + Huffman Deflate LZX LZS LZ77 + ANS LZFS...


INDEX: 5
TABLE:


Unnamed: 0,0,1
0,Entropy type,Adaptive coding Arithmetic Asymmetric numeral ...
1,Dictionary type,Byte pair encoding Lempel–Ziv 842 LZ4 LZJB LZO...
2,Other types,BWT CTW CM Delta Incremental DMC DPCM Grammar ...
3,Hybrid,LZ77 + Huffman Deflate LZX LZS LZ77 + ANS LZFS...


INDEX: 6
TABLE:


Unnamed: 0,0,1
0,Transform type,Discrete cosine transform DCT MDCT DST FFT Wav...
1,Predictive type,DPCM ADPCM LPC ACELP CELP LAR LSP WLPC Motion ...


INDEX: 7
TABLE:


Unnamed: 0,0,1
0,Concepts,Bit rate ABR CBR VBR Companding Convolution Dy...
1,Codec parts,A-law μ-law DPCM ADPCM DM FT FFT LPC ACELP CEL...


INDEX: 8
TABLE:


Unnamed: 0,0,1
0,Concepts,Chroma subsampling Coding tree unit Color spac...
1,Methods,Chain code DCT Deflate Fractal KLT LP RLE Wave...


INDEX: 9
TABLE:


Unnamed: 0,0,1
0,Concepts,Bit rate ABR CBR VBR Display resolution Frame ...
1,Codec parts,DCT DPCM Deblocking filter Lapped transform Mo...


INDEX: 10
TABLE:


Unnamed: 0,vteDigital video resolutions,vteDigital video resolutions.1,Unnamed: 2,Unnamed: 3
0,Designation,Usage examples Definition (lines) Rate (Hz) ...,,
1,Usage examples,Definition (lines),Rate (Hz),Rate (Hz)
2,Usage examples,Definition (lines),Interlaced (fields),Progressive (frames)
3,"Low, MP@LL","LDTV, VCD, HTV 240, 288 (SIF) 24, 30; 25",,
4,"LDTV, VCD, HTV","240, 288 (SIF)",,"24, 30; 25"


INDEX: 11
TABLE:


Unnamed: 0_level_0,Usage examples,Definition (lines),Rate (Hz),Rate (Hz)
Unnamed: 0_level_1,Usage examples,Definition (lines),Interlaced (fields),Progressive (frames)


INDEX: 12
TABLE:


Unnamed: 0,0,1,2,3
0,"LDTV, VCD, HTV","240, 288 (SIF)",,"24, 30; 25"


INDEX: 13
TABLE:


Unnamed: 0,0,1,2,3
0,"SDTV, SVCD, DVD, DV","480 (NTSC), 576 (PAL/SECAM)",60; 50,"24, 30; 25"
1,"SDTV, SVCD, DVD, DV",,,


INDEX: 14
TABLE:


Unnamed: 0,0,1,2,3
0,EDTV,"480, 540 (NTSC-HQ), 576 (PAL-HQ)",,"24, 30; 25"


INDEX: 15
TABLE:


Unnamed: 0,0,1,2,3
0,"HDTV, BD, HD DVD, HDV",720,,"24, 30, 60; 25, 50"
1,"HDTV, BD, HD DVD, HDV","1080, 1440",60; 50,"24, 30, 60; 25, 50"


INDEX: 16
TABLE:


Unnamed: 0,0,1,2,3
0,"UHDTV, UHD BRD","2160, 4320",,"60, 120, 180"


### 18 tables?!?!?

That's a lot of tables, but it looks like the table at `index == 4` is the one we want!


In [13]:
resolutions = tables[4]
display(resolutions.head(n=10))

Unnamed: 0,vteData compression methods,vteData compression methods.1
0,Lossless,Entropy type Adaptive coding Arithmetic Asymme...
1,Entropy type,Adaptive coding Arithmetic Asymmetric numeral ...
2,Dictionary type,Byte pair encoding Lempel–Ziv 842 LZ4 LZJB LZO...
3,Other types,BWT CTW CM Delta Incremental DMC DPCM Grammar ...
4,Hybrid,LZ77 + Huffman Deflate LZX LZS LZ77 + ANS LZFS...
5,Lossy,Transform type Discrete cosine transform DCT M...
6,Transform type,Discrete cosine transform DCT MDCT DST FFT Wav...
7,Predictive type,DPCM ADPCM LPC ACELP CELP LAR LSP WLPC Motion ...
8,Audio,Concepts Bit rate ABR CBR VBR Companding Convo...
9,Concepts,Bit rate ABR CBR VBR Companding Convolution Dy...


Now that we have "discovered" where the table we want it located, we can tidy our code up as:

In [14]:
tables = pd.read_html("https://en.wikipedia.org/wiki/Display_resolution")
# we we discovered its at index 4
resolutions = tables[4] 
display(resolutions.head(n=10))

Unnamed: 0,vteData compression methods,vteData compression methods.1
0,Lossless,Entropy type Adaptive coding Arithmetic Asymme...
1,Entropy type,Adaptive coding Arithmetic Asymmetric numeral ...
2,Dictionary type,Byte pair encoding Lempel–Ziv 842 LZ4 LZJB LZO...
3,Other types,BWT CTW CM Delta Incremental DMC DPCM Grammar ...
4,Hybrid,LZ77 + Huffman Deflate LZX LZS LZ77 + ANS LZFS...
5,Lossy,Transform type Discrete cosine transform DCT M...
6,Transform type,Discrete cosine transform DCT MDCT DST FFT Wav...
7,Predictive type,DPCM ADPCM LPC ACELP CELP LAR LSP WLPC Motion ...
8,Audio,Concepts Bit rate ABR CBR VBR Companding Convo...
9,Concepts,Bit rate ABR CBR VBR Companding Convolution Dy...


### 1.2 You Code 

Write code to extract the **S&P 500 component stocks** table from this webpage:   

`https://en.wikipedia.org/wiki/List_of_S%26P_500_companies` [https://en.wikipedia.org/wiki/List_of_S%26P_500_companies](https://en.wikipedia.org/wiki/List_of_S%26P_500_companies)

TIP: Use the cell above this one to "figure it out" and once you know the exact code, place it in the cell below. Name the DataFrame variable `sandp`, and use the `display()` function to show a random `sample()` of 10 companies.

In [15]:
# todo write code here

tables = pd.read_html("https://en.wikipedia.org/wiki/List_of_S%26P_500_companies")
sandp = tables[0]
display(sandp.sample(10))

#other way to do it..
tables = pd.read_html("https://en.wikipedia.org/wiki/List_of_S%26P_500_companies")
sandp = tables[0]
display(sandp.head(n=10))

Unnamed: 0,Symbol,Security,GICS Sector,GICS Sub-Industry,Headquarters Location,Date added,CIK,Founded
452,TRV,Travelers Companies (The),Financials,Property & Casualty Insurance,"New York City, New York",2002-08-21,86312,1853
198,FE,FirstEnergy,Utilities,Electric Utilities,"Akron, Ohio",1997-11-28,1031296,1997
205,FOX,Fox Corporation (Class B),Communication Services,Broadcasting,"New York City, New York",2019-03-04,1754301,2019
446,TXT,Textron,Industrials,Aerospace & Defense,"Providence, Rhode Island",1978-12-31,217346,1923
443,TER,Teradyne,Information Technology,Semiconductor Materials & Equipment,"North Reading, Massachusetts",2020-09-21,97210,1960
464,URI,United Rentals,Industrials,Trading Companies & Distributors,"Stamford, Connecticut",2014-09-20,1067701,1997
379,PPL,PPL Corporation,Utilities,Electric Utilities,"Allentown, Pennsylvania",2001-10-01,922224,1920
56,BALL,Ball Corporation,Materials,"Metal, Glass & Plastic Containers","Broomfield, Colorado",1984-10-31,9389,1880
64,BIO,Bio-Rad,Health Care,Life Sciences Tools & Services,"Hercules, California",2020-06-22,12208,1952
447,TMO,Thermo Fisher Scientific,Health Care,Life Sciences Tools & Services,"Waltham, Massachusetts",2004-08-03,97745,2006 (1902)


Unnamed: 0,Symbol,Security,GICS Sector,GICS Sub-Industry,Headquarters Location,Date added,CIK,Founded
0,MMM,3M,Industrials,Industrial Conglomerates,"Saint Paul, Minnesota",1957-03-04,66740,1902
1,AOS,A. O. Smith,Industrials,Building Products,"Milwaukee, Wisconsin",2017-07-26,91142,1916
2,ABT,Abbott,Health Care,Health Care Equipment,"North Chicago, Illinois",1957-03-04,1800,1888
3,ABBV,AbbVie,Health Care,Biotechnology,"North Chicago, Illinois",2012-12-31,1551152,2013 (1888)
4,ACN,Accenture,Information Technology,IT Consulting & Other Services,"Dublin, Ireland",2011-07-06,1467373,1989
5,ADBE,Adobe Inc.,Information Technology,Application Software,"San Jose, California",1997-05-05,796343,1982
6,AMD,Advanced Micro Devices,Information Technology,Semiconductors,"Santa Clara, California",2017-03-20,2488,1969
7,AES,AES Corporation,Utilities,Independent Power Producers & Energy Traders,"Arlington, Virginia",1998-10-02,874761,1981
8,AFL,Aflac,Financials,Life & Health Insurance,"Columbus, Georgia",1999-05-28,4977,1955
9,A,Agilent Technologies,Health Care,Life Sciences Tools & Services,"Santa Clara, California",2000-06-05,1090872,1999


## Merging two DataFrames together on a common/maching column.

Right now we have 2 DataFrame sets of data

`companies` - our list of companies.  
`sandp` - the companies on the S&P 500 index

In [16]:
companies.sort_values("symbol").head(5)

Unnamed: 0,symbol,name,exchange,industry,sector,info.website,info.city,info.state,info.country
2,AAPL,Apple Inc.,NMS,Consumer Electronics,Technology,https://www.apple.com,Cupertino,CA,United States
3,AMZN,"Amazon.com, Inc.",NMS,Internet Retail,Consumer Cyclical,https://www.aboutamazon.com,Seattle,WA,United States
7,DELL,Dell Technologies Inc.,NYQ,Computer Hardware,Technology,https://www.delltechnologies.com,Round Rock,TX,United States
1,GM,General Motors Company,NYQ,Auto Manufacturers,Consumer Cyclical,https://www.gm.com,Detroit,MI,United States
5,GOOG,Alphabet Inc.,NMS,Internet Content & Information,Communication Services,https://abc.xyz,Mountain View,CA,United States


In [17]:
sandp.sort_values("Symbol").head(5)

Unnamed: 0,Symbol,Security,GICS Sector,GICS Sub-Industry,Headquarters Location,Date added,CIK,Founded
9,A,Agilent Technologies,Health Care,Life Sciences Tools & Services,"Santa Clara, California",2000-06-05,1090872,1999
25,AAL,American Airlines Group,Industrials,Passenger Airlines,"Fort Worth, Texas",2015-03-23,6201,1934
39,AAPL,Apple Inc.,Information Technology,"Technology Hardware, Storage & Peripherals","Cupertino, California",1982-11-30,320193,1977
3,ABBV,AbbVie,Health Care,Biotechnology,"North Chicago, Illinois",2012-12-31,1551152,2013 (1888)
11,ABNB,Airbnb,Consumer Discretionary,"Hotels, Resorts & Cruise Lines","San Francisco, California",2023-09-18,1559720,2008


You can see that `AAPL` is on both our company list and the S&P500 company list.  Its great to observe that but even better to output it programmatically with code. 

### Join types

For two datasets, in this case:

```
+===========+                 +===========+
| companies |                 |   sandp   |
+===========+                 +===========+
|   Our     |                 |  S and P  |
| Companies |                 | 500 Index |
+-----------+                 +-----------+
| column:   |                 | column:   | 
|   symbol  |                 |  Symbol   | 
+-----------+                 +-----------+
```
Consider `companies` on the left and `sandp` on the right. Left and right are relative but we need some kind of positioning for reference.

Here are the 3 join possibilities `left`, `inner` and `right` along with their results.

```
+===========+  +===========+  +===========+
| how:left  |  | how:inner |  | how:right |
+===========+  +===========+  +===========+
| RESULTS:  |  | RESULTS:  |  | RESULTS:  |
| inner +   |  | only rows |  | inner +   |
| all rows  |  | IN BOTH   |  | all rows  |
|  from     |  | companies |  |  from     |
| companies |  | AND sandp |  | sandp     |
+-----------+  +-----------+  +-----------+

```

So in Summary

- `how='inner'` ==> the resulting DataFrame contains only matches from the `left` and `right`
- `how='left'` ==> the resulting DataFrame contains all of the `left` + matches from the `left` and `right`
- `how='right'` ==> the resulting DataFrame contains all of the `right` + matches from the `left` and `right`

### Which companies are not on the S&P 500?

Together let's figure out which `companies` are NOT on the `sandp`.

This is a two step process:

1. `merge()` the dataframes together using a `how='left'`. Because we said `left`, the results will include matches `companies['symbol'] == sandp['Symbol']` in addition to all the rows from `companies` (because its on the left).
2. Filter out any rows where the `joined['Symbol'].isna()` because if its `np.nan` that means there was no match.

And what remains are companies that are NOT on the S&P 500!!!

In [18]:
# first perform the join
joined = pd.merge(left=companies, right=sandp, how="left", left_on="symbol", right_on="Symbol")

# second filter out any of the matches
not_on_sandp = joined[joined["Symbol"].isna()]
not_on_sandp

Unnamed: 0,symbol,name,exchange,industry,sector,info.website,info.city,info.state,info.country,Symbol,Security,GICS Sector,GICS Sub-Industry,Headquarters Location,Date added,CIK,Founded
0,X,United States Steel Corporation,NYQ,Steel,Basic Materials,https://www.ussteel.com,Pittsburgh,PA,United States,,,,,,,,
6,TTD,"The Trade Desk, Inc.",NGM,Software - Application,Technology,https://www.thetradedesk.com,Ventura,CA,United States,,,,,,,,
7,DELL,Dell Technologies Inc.,NYQ,Computer Hardware,Technology,https://www.delltechnologies.com,Round Rock,TX,United States,,,,,,,,
10,NET,"Cloudflare, Inc.",NYQ,Software - Infrastructure,Technology,https://www.cloudflare.com,San Francisco,CA,United States,,,,,,,,


### 1.3 You Code

Now you try it use the `merge()` method function to join the `companies` to `sandp` but this time only show matches. If you use a different `how` you can complete this in a single step.

Save the results in a `matched` dataframe and `display()` it.

In [19]:
# todo write code here
matched = pd.merge(left=companies, right=sandp, how="inner", left_on="symbol", right_on="Symbol")
display(matched)


Unnamed: 0,symbol,name,exchange,industry,sector,info.website,info.city,info.state,info.country,Symbol,Security,GICS Sector,GICS Sub-Industry,Headquarters Location,Date added,CIK,Founded
0,GM,General Motors Company,NYQ,Auto Manufacturers,Consumer Cyclical,https://www.gm.com,Detroit,MI,United States,GM,General Motors,Consumer Discretionary,Automobile Manufacturers,"Detroit, Michigan",2013-06-06,1467858,1908
1,AAPL,Apple Inc.,NMS,Consumer Electronics,Technology,https://www.apple.com,Cupertino,CA,United States,AAPL,Apple Inc.,Information Technology,"Technology Hardware, Storage & Peripherals","Cupertino, California",1982-11-30,320193,1977
2,AMZN,"Amazon.com, Inc.",NMS,Internet Retail,Consumer Cyclical,https://www.aboutamazon.com,Seattle,WA,United States,AMZN,Amazon,Consumer Discretionary,Broadline Retail,"Seattle, Washington",2005-11-18,1018724,1994
3,META,"Meta Platforms, Inc.",NMS,Internet Content & Information,Communication Services,https://investor.fb.com,Menlo Park,CA,United States,META,Meta Platforms,Communication Services,Interactive Media & Services,"Menlo Park, California",2013-12-23,1326801,2004
4,GOOG,Alphabet Inc.,NMS,Internet Content & Information,Communication Services,https://abc.xyz,Mountain View,CA,United States,GOOG,Alphabet Inc. (Class C),Communication Services,Interactive Media & Services,"Mountain View, California",2006-04-03,1652044,1998
5,IBM,International Business Machines,NYQ,Information Technology Services,Technology,https://www.ibm.com,Armonk,NY,United States,IBM,IBM,Information Technology,IT Consulting & Other Services,"Armonk, New York",1957-03-04,51143,1911
6,MSFT,Microsoft Corporation,NMS,Software - Infrastructure,Technology,https://www.microsoft.com,Redmond,WA,United States,MSFT,Microsoft,Information Technology,Systems Software,"Redmond, Washington",1994-06-01,789019,1975
7,NFLX,"Netflix, Inc.",NMS,Entertainment,Communication Services,https://www.netflix.com,Los Gatos,CA,United States,NFLX,Netflix,Communication Services,Movies & Entertainment,"Los Gatos, California",2010-12-20,1065280,1997
8,TSLA,"Tesla, Inc.",NMS,Auto Manufacturers,Consumer Cyclical,https://www.tesla.com,Austin,TX,United States,TSLA,"Tesla, Inc.",Consumer Discretionary,Automobile Manufacturers,"Austin, Texas",2020-12-21,1318605,2003
9,HD,"Home Depot, Inc. (The)",NYQ,Home Improvement Retail,Consumer Cyclical,https://www.homedepot.com,Atlanta,GA,United States,HD,Home Depot (The),Consumer Discretionary,Home Improvement Retail,"Atlanta, Georgia",1988-03-31,354950,1978


## Combining DataFrames by row.

We can use the `concat()` method function to combine rows of multiple dataframes into a single dataframe with more rows.

For example if you `contact()` three dataframes with 10, 15 and 20 rows the resulting dataframe will have 10+15+20 == 45 rows.

In this example we read in stock history for the 3 companies in the list, and append them.

In [20]:
microsoft = pd.read_csv("stocks/MSFT.csv")
microsoft

Unnamed: 0,Date,Symbol,Open,High,Low,Close,Volume
0,2024-03-18,MSFT,414.25,420.730011,413.779999,417.320007,20106000
1,2024-03-19,MSFT,417.829987,421.670013,415.549988,421.410004,19837900
2,2024-03-20,MSFT,422.0,425.959991,420.660004,425.230011,17860100
3,2024-03-21,MSFT,429.829987,430.820007,427.160004,429.369995,21296200
4,2024-03-22,MSFT,429.700012,429.859985,426.070007,428.73999,17636500


In [21]:
google = pd.read_csv("stocks/GOOG.csv")
google

Unnamed: 0,Date,Symbol,Open,High,Low,Close,Volume
0,2024-03-18,GOOG,149.369995,152.929993,148.139999,148.479996,47676700
1,2024-03-19,GOOG,148.979996,149.619995,147.009995,147.919998,17748400
2,2024-03-20,GOOG,148.789993,149.759995,147.664993,149.679993,17730000
3,2024-03-21,GOOG,150.320007,151.304993,148.009995,148.740005,19843900
4,2024-03-22,GOOG,150.240005,152.559998,150.089996,151.770004,19226300


In [22]:
apple = pd.read_csv("stocks/AAPL.csv")
apple

Unnamed: 0,Date,Symbol,Open,High,Low,Close,Volume
0,2024-03-18,AAPL,175.570007,177.710007,173.520004,173.720001,75604200
1,2024-03-19,AAPL,174.339996,176.610001,173.029999,176.080002,55215200
2,2024-03-20,AAPL,175.720001,178.669998,175.089996,178.669998,53423100
3,2024-03-21,AAPL,177.050003,177.490005,170.839996,171.369995,106181300
4,2024-03-22,AAPL,171.759995,173.050003,170.059998,172.279999,71106600


In [23]:
combined = pd.concat([microsoft, google, apple], ignore_index=True)
combined

Unnamed: 0,Date,Symbol,Open,High,Low,Close,Volume
0,2024-03-18,MSFT,414.25,420.730011,413.779999,417.320007,20106000
1,2024-03-19,MSFT,417.829987,421.670013,415.549988,421.410004,19837900
2,2024-03-20,MSFT,422.0,425.959991,420.660004,425.230011,17860100
3,2024-03-21,MSFT,429.829987,430.820007,427.160004,429.369995,21296200
4,2024-03-22,MSFT,429.700012,429.859985,426.070007,428.73999,17636500
5,2024-03-18,GOOG,149.369995,152.929993,148.139999,148.479996,47676700
6,2024-03-19,GOOG,148.979996,149.619995,147.009995,147.919998,17748400
7,2024-03-20,GOOG,148.789993,149.759995,147.664993,149.679993,17730000
8,2024-03-21,GOOG,150.320007,151.304993,148.009995,148.740005,19843900
9,2024-03-22,GOOG,150.240005,152.559998,150.089996,151.770004,19226300


Notice to use `concat()` the target dataframes must be in a list.

### 1.4 You Code

Let's make the previous example more efficient by using a loop. Most of this code has been written for you. You just need to write the one line of code to read in each stock inside the body of the loop.


In [24]:
# todo: repeat the analysis in the previous cell for Pclass 
stocks = ["MSFT", "GOOG", "AAPL"]
combined = pd.DataFrame()
for stock in stocks:
    filename = f"stocks/{stock}.csv"
    # todo read the filename into `stocks_df`
    stocks_df = pd.read_csv(filename)
    combined = pd.concat([combined, stocks_df], ignore_index=True)
combined

Unnamed: 0,Date,Symbol,Open,High,Low,Close,Volume
0,2024-03-18,MSFT,414.25,420.730011,413.779999,417.320007,20106000
1,2024-03-19,MSFT,417.829987,421.670013,415.549988,421.410004,19837900
2,2024-03-20,MSFT,422.0,425.959991,420.660004,425.230011,17860100
3,2024-03-21,MSFT,429.829987,430.820007,427.160004,429.369995,21296200
4,2024-03-22,MSFT,429.700012,429.859985,426.070007,428.73999,17636500
5,2024-03-18,GOOG,149.369995,152.929993,148.139999,148.479996,47676700
6,2024-03-19,GOOG,148.979996,149.619995,147.009995,147.919998,17748400
7,2024-03-20,GOOG,148.789993,149.759995,147.664993,149.679993,17730000
8,2024-03-21,GOOG,150.320007,151.304993,148.009995,148.740005,19843900
9,2024-03-22,GOOG,150.240005,152.559998,150.089996,151.770004,19226300


## Lambdas and apply()

The Pandas `apply()` method allows us to write a user-defined function and the invoke that function for every row in the dataframe.

This is useful when you need to implement complex transformational logic on your dataframes.

### Example

Let's predend there is an applied tax rate based on the `info.state` based on the following table:

    - NY = 0.15
    - WA = 0.10
    - CA = 0.20
    - Tx = 0.05
    - Everyone else = 0.0

We've seen before you can write this as a function:

In [25]:
def taxrate(state: str) -> float:
    state = state.upper()
    if state == "NY":
        rate = 0.15
    elif state == "WA":
        rate = 0.1 
    elif state == 'CA':
        rate = 0.2
    elif state == 'TX':
        rate = 0.05
    else:
        rate = 0.0
    return rate

# simple test
assert taxrate("TX") == 0.05

With the function created we can now use `apply()` to calculate a `"tax"` column:

In [26]:
companies["tax"] = companies.apply(lambda row: taxrate(row["info.state"]), axis=1)
companies.head()

Unnamed: 0,symbol,name,exchange,industry,sector,info.website,info.city,info.state,info.country,tax
0,X,United States Steel Corporation,NYQ,Steel,Basic Materials,https://www.ussteel.com,Pittsburgh,PA,United States,0.0
1,GM,General Motors Company,NYQ,Auto Manufacturers,Consumer Cyclical,https://www.gm.com,Detroit,MI,United States,0.0
2,AAPL,Apple Inc.,NMS,Consumer Electronics,Technology,https://www.apple.com,Cupertino,CA,United States,0.2
3,AMZN,"Amazon.com, Inc.",NMS,Internet Retail,Consumer Cyclical,https://www.aboutamazon.com,Seattle,WA,United States,0.1
4,META,"Meta Platforms, Inc.",NMS,Internet Content & Information,Communication Services,https://investor.fb.com,Menlo Park,CA,United States,0.2


#### NOTE!!!

For more details on `lambda/apply` check the assigned reading!

In [27]:
def change(open: float, close: float) -> float:
    return close - open
assert change(1.5, 1.25) == -0.25


### 1.5 You Code 

Using the function `change()` as defined in the cell above, add a column to the `combined` dataframe from 1.4 called `"change"` which calculates the change in the stock for each row. `display()` the output.

In [28]:
# todo write code here
import pandas as pd
import numpy as np

def change(open: float, close: float) -> float:
    return close - open

stocks = ["MSFT", "GOOG", "AAPL"]

combined = pd.DataFrame()

for stock in stocks:
    filename = f"stocks/{stock}.csv"
    stocks_df = pd.read_csv(filename)
    combined = pd.concat([combined, stocks_df], ignore_index=True)

combined['Change'] = combined.apply(lambda row: change(row['Open'], row['Close']), axis=1)
display(combined)


Unnamed: 0,Date,Symbol,Open,High,Low,Close,Volume,change
0,2024-03-18,MSFT,414.25,420.730011,413.779999,417.320007,20106000,3.070007
1,2024-03-19,MSFT,417.829987,421.670013,415.549988,421.410004,19837900,3.580017
2,2024-03-20,MSFT,422.0,425.959991,420.660004,425.230011,17860100,3.230011
3,2024-03-21,MSFT,429.829987,430.820007,427.160004,429.369995,21296200,-0.459991
4,2024-03-22,MSFT,429.700012,429.859985,426.070007,428.73999,17636500,-0.960022
5,2024-03-18,GOOG,149.369995,152.929993,148.139999,148.479996,47676700,-0.889999
6,2024-03-19,GOOG,148.979996,149.619995,147.009995,147.919998,17748400,-1.059998
7,2024-03-20,GOOG,148.789993,149.759995,147.664993,149.679993,17730000,0.889999
8,2024-03-21,GOOG,150.320007,151.304993,148.009995,148.740005,19843900,-1.580002
9,2024-03-22,GOOG,150.240005,152.559998,150.089996,151.770004,19226300,1.529999


# Metacognition


### Rate your comfort level with this week's material so far.   

**1** ==> I don't understand this at all yet and need extra help. If you choose this please try to articulate that which you do not understand to the best of your ability in the questions and comments section below.  
**2** ==> I can do this with help or guidance from other people or resources. If you choose this level, please indicate HOW this person helped you in the questions and comments section below.   
**3** ==> I can do this on my own without any help.   
**4** ==> I can do this on my own and can explain/teach how to do it to others.

`ENTER A NUMBER 1-4 IN THE CELL BELOW`

1

###  Questions And Comments 

Record any questions or comments you have about this lab that you would like to discuss in your recitation. It is expected you will have questions if you  complete this assignment.  Learning how to articulate what you do not understand is an important skill of critical thinking. Write your questions below so that you remember to ask them in your recitation. We expect you will take responsilbity for your learning and ask questions in class.

`ENTER YOUR QUESTIONS/COMMENTS IN THE CELL BELOW`  


I chose 1 because I'm really struggling with the concept of merging DataFrames and how it works. I don’t understand how to merge two DataFrames on a common column or what the different join types (inner, left, right) actually do. It's confusing to figure out how these joins affect the resulting DataFrame and what kind of data is included or excluded. Additionally, I'm not sure if the merging process works the same way as it would in SQL. In SQL, I know how joins work, but I'm not sure how similar it is in pandas. What are the bigger conceptual differences between merging DataFrames in pandas and performing joins in SQL? I could really use some help or examples to better understand these merging concepts and how they function in the system.

## Turn it In

FIRST AND FOREMOST: **Save Your work!** Yes, it auto-saves, but you should get in the habit of saving before submitting. From the menu, choose File --> Save Notebook. Or you can use the shortcut keys `CTRL+S`

### First: Lab Check

Check your lab before submitting. Look for errors and incomplete parts which might cost you a better grade

In [29]:
from casstools.notebook_tools import NotebookFile
NotebookFile().check_lab()

✅ The lab submission appears to have no issues.
  1.0 Percent of cell executed.
  Summary of code Exercises
  CODE	SYNTAX	SIMILARITY
  1.1	ok	0.5555555555555556
  1.2	ok	1.0
  1.3	ok	1.0
  1.4	ok	0
  1.5	ok	1.0


### Second: Lab Submission

Run this code and follow the instructions to turn in your lab. 

In [30]:
from casstools.assignment import Assignment
Assignment().submit()

✅ TIMESTAMP  : 2024-07-24 18:10
✅ COURSE     : ist256
✅ TERM       : summer2024
✅ USER       : dlnosky@syr.edu
✅ STUDENT    : True
💣 ERROR GETTING ASSIGNMENT INFORMATION 💣
❌ Error Details: labsampleanswerset.ipynb is not an assignment on the course assignment list.
Possible Causes:
 - Is the assignment 'labsampleanswerset.ipynb'on the assignment list?
