
# Hands-on Activity 8.1: Aggregating Data with Pandas

### 8.1.1 Intended Learning Outcomes
After this activity, the student should be able to:
- Demonstrate querying and merging of dataframes
- Perform advanced calculations on dataframes
- Aggregate dataframes with pandas and numpy
- Work with time series data

### 8.1.2 Resources
- Computing Environment using Python 3.x
- Attached Datasets (under Instructional Materials)

### 8.1.3 Procedures
The procedures can be found in the canvas module. Check the following under topics:
- Weather Data Collection
- Querying and Merging
- Dataframe Operations
- Aggregations
- Time Series

### 8.1.4 Data Analysis
Provide some comments here about the results of the procedures.

### 8.1.5 Supplementary Activity
Using the CSV files provided and what we have learned so far in this module complete the following exercises:

1. With the earthquakes.csv file, select all the earthquakes in Japan with a magType of mb and a magnitude of 4.9 or greater.
2. Create bins for each full number of magnitude (for example, the first bin is 0-1, the second is 1-2, and so on) with a magType of ml and count how many are in each bin.
3. Using the faang.csv file, group by the ticker and resample to monthly frequency. Make the following aggregations:
   - Mean of the opening price
   - Maximum of the high price
   - Minimum of the low price
   - Mean of the closing price
   - Sum of the volume traded
4. Build a crosstab with the earthquake data between the tsunami column and the magType column. Rather than showing the frequency count, show the maximum magnitude that was observed for each combination. Put the magType along the columns.
5. Calculate the rolling 60-day aggregations of OHLC data by ticker for the FAANG data. Use the same aggregations as exercise no. 3.
6. Create a pivot table of the FAANG data that compares the stocks. Put the ticker in the rows and show the averages of the OHLC and volume traded data.
7. Calculate the Z-scores for each numeric column of Netflix's data (ticker is NFLX) using apply().
8. Add event descriptions:
   - Create a dataframe with the following three columns: ticker, date, and event. The columns should have the following values:
     - ticker: 'FB'
     - date: ['2018-07-25', '2018-03-19', '2018-03-20']
     - event: ['Disappointing user growth announced after close.', 'Cambridge Analytica story', 'FTC investigation']
   - Set the index to ['date', 'ticker']
   - Merge this data with the FAANG data using an outer join.
9. Use the transform() method on the FAANG data to represent all the values in terms of the first date in the data. To do so, divide all the values for each ticker by the values for the first date in the data for that ticker. This is referred to as an index, and the data for the first date is the base. When data is in this format, we can easily see growth over time. Hint: transform() can take a function name.

[Link](https://ec.europa.eu/eurostat/statistics-explained/index.php/Beginners:Statisticalconcept-Indexandbaseyear)


#With the earthquakes.csv file, select all the earthquakes in Japan with a magType of mb and a magnitude of 4.9 or greater.

In [None]:
import pandas as pd

earthquakes = pd.read_csv('earthquakes.csv')
earthquakes

Unnamed: 0,mag,magType,time,place,tsunami,parsed_place
0,1.35,ml,1539475168010,"9km NE of Aguanga, CA",0,California
1,1.29,ml,1539475129610,"9km NE of Aguanga, CA",0,California
2,3.42,ml,1539475062610,"8km NE of Aguanga, CA",0,California
3,0.44,ml,1539474978070,"9km NE of Aguanga, CA",0,California
4,2.16,md,1539474716050,"10km NW of Avenal, CA",0,California
...,...,...,...,...,...,...
9327,0.62,md,1537230228060,"9km ENE of Mammoth Lakes, CA",0,California
9328,1.00,ml,1537230135130,"3km W of Julian, CA",0,California
9329,2.40,md,1537229908180,"35km NNE of Hatillo, Puerto Rico",0,Puerto Rico
9330,1.10,ml,1537229545350,"9km NE of Aguanga, CA",0,California


In [None]:
japan_earthquakes = earthquakes[earthquakes['place'].str.contains('Japan', na=False)]
selected = japan_earthquakes[(japan_earthquakes['magType'] == 'mb') & (japan_earthquakes['mag'] >= 4.9)]
print(selected)

      mag magType           time                         place  tsunami  \
1563  4.9      mb  1538977532250  293km ESE of Iwo Jima, Japan        0   
2576  5.4      mb  1538697528010    37km E of Tomakomai, Japan        0   
3072  4.9      mb  1538579732490     15km ENE of Hasaki, Japan        0   
3632  4.9      mb  1538450871260    53km ESE of Hitachi, Japan        0   

     parsed_place  
1563        Japan  
2576        Japan  
3072        Japan  
3632        Japan  


#Create bins for each full number of magnitude (for example, the first bin is 0-1, the second is 1-2, and so on) with a magType of ml and count how many are in each bin.

In [None]:
# Earthquakes with magType of ml
ml_earthquakes = earthquakes[earthquakes['magType'] == 'ml']

# Create bins for each full number of magnitude
bins = pd.cut(ml_earthquakes['mag'], bins=range(int(ml_earthquakes['mag'].min()), int(ml_earthquakes['mag'].max())+2), right=False)

# Count in each bin and sort the index
bin_counts = bins.value_counts().sort_index()

# Create a DataFrame to hold bin counts and specify column names
bin_counts_df = pd.DataFrame({'Magnitude Range': [f"{bin_range.left}-{bin_range.right}" for bin_range in bin_counts.index],
                              'Frequency': bin_counts.values})

print(bin_counts_df)

  Magnitude Range  Frequency
0            -1-0        446
1             0-1       2072
2             1-2       3126
3             2-3        985
4             3-4        153
5             4-5          6
6             5-6          2


#Using the faang.csv file, group by the ticker and resample to monthly frequency. Make the following aggregations:

Mean of the opening price

In [57]:
import pandas as pd

faang = pd.read_csv('faang.csv')
faang

Unnamed: 0,ticker,date,open,high,low,close,volume
0,FB,2018-01-02,177.68,181.58,177.5500,181.42,18151903
1,FB,2018-01-03,181.88,184.78,181.3300,184.67,16886563
2,FB,2018-01-04,184.90,186.21,184.0996,184.33,13880896
3,FB,2018-01-05,185.59,186.90,184.9300,186.85,13574535
4,FB,2018-01-08,187.20,188.90,186.3300,188.28,17994726
...,...,...,...,...,...,...,...
1250,GOOG,2018-12-24,973.90,1003.54,970.1100,976.22,1590328
1251,GOOG,2018-12-26,989.01,1040.00,983.0000,1039.46,2373270
1252,GOOG,2018-12-27,1017.15,1043.89,997.0000,1043.88,2109777
1253,GOOG,2018-12-28,1049.62,1055.56,1033.1000,1037.08,1413772


In [75]:
import pandas as pd

# Load the FAANG data from the CSV file
faang = pd.read_csv('faang.csv')

# Convert the 'date' column to datetime format
faang['date'] = pd.to_datetime(faang['date'])

# Set the 'date' column as the index
faang.set_index('date', inplace=True)

# Group by the 'ticker' column and resample to monthly frequency
monthly = faang.groupby('ticker').resample('M')

# Define aggregations
aggregations = {
    'open': 'mean',
    'high': 'max',
    'low': 'min',
    'close': 'mean',
    'volume': 'sum'
}

# Apply aggregations
monthly_aggregated = monthly.agg(aggregations)

print("Monthly Aggregations:")
print(monthly_aggregated)


Monthly Aggregations:
                          open       high        low        close     volume
ticker date                                                                 
AAPL   2018-01-31   170.714690   176.6782   161.5708   170.699271  659679440
       2018-02-28   164.562753   177.9059   147.9865   164.921884  927894473
       2018-03-31   172.421381   180.7477   162.4660   171.878919  713727447
       2018-04-30   167.332895   176.2526   158.2207   167.286924  666360147
       2018-05-31   182.635582   187.9311   162.7911   183.207418  620976206
       2018-06-30   186.605843   192.0247   178.7056   186.508652  527624365
       2018-07-31   188.065786   193.7650   181.3655   188.179724  393843881
       2018-08-31   210.460287   227.1001   195.0999   211.477743  700318837
       2018-09-30   220.611742   227.8939   213.6351   220.356353  678972040
       2018-10-31   219.489426   231.6645   204.4963   219.137822  789748068
       2018-11-30   190.828681   220.6405   169.5328  

Build a crosstab with the earthquake data between the tsunami column and the magType column. Rather than showing the frequency count, show the maximum magnitude that was observed for each combination. Put the magType along the columns.


In [74]:
pivot_table = earthquakes.pivot_table(index='tsunami', columns='magType', values='mag', aggfunc='max')

print(pivot_table)

magType   mb  mb_lg    md   mh   ml  ms_20    mw  mwb  mwr  mww
tsunami                                                        
0        5.6    3.5  4.11  1.1  4.2    NaN  3.83  5.8  4.8  6.0
1        6.1    NaN   NaN  NaN  5.1    5.7  4.41  NaN  NaN  7.5


In [81]:
import pandas as pd

# Load the FAANG data from the CSV file
faang = pd.read_csv('faang.csv')

# Convert the 'date' column to datetime format
faang['date'] = pd.to_datetime(faang['date'])

# Set the 'date' column as the index
faang.set_index('date', inplace=True)

# Define rolling window size
size = '60D'

# Define aggregations
rolling_aggregations = {
    'open': 'mean',
    'high': 'max',
    'low': 'min',
    'close': 'mean',
    'volume': 'sum'
}

# Apply rolling 60-day aggregations by ticker
rolling_aggregated = faang.groupby('ticker').rolling(window=size).agg(rolling_aggregations)

# Reset index for better visualization
rolling_aggregated.reset_index(inplace=True)

# Sort DataFrame by ticker and date
rolling_aggregated.sort_values(by=['ticker', 'date'], inplace=True)

print("Rolling 60-day Aggregations:")
print(rolling_aggregated)


Rolling 60-day Aggregations:
     ticker       date        open      high       low       close  \
0      AAPL 2018-01-02  166.927100  169.0264  166.0442  168.987200   
1      AAPL 2018-01-03  168.089600  171.2337  166.0442  168.972500   
2      AAPL 2018-01-04  168.480367  171.2337  166.0442  169.229200   
3      AAPL 2018-01-05  168.896475  172.0381  166.0442  169.840675   
4      AAPL 2018-01-08  169.324680  172.2736  166.0442  170.080040   
...     ...        ...         ...       ...       ...         ...   
1250   NFLX 2018-12-24  283.509250  332.0499  233.6800  281.931750   
1251   NFLX 2018-12-26  281.844500  332.0499  231.2300  280.777750   
1252   NFLX 2018-12-27  281.070488  332.0499  231.2300  280.162805   
1253   NFLX 2018-12-28  279.916341  332.0499  231.2300  279.461341   
1254   NFLX 2018-12-31  278.430769  332.0499  231.2300  277.451410   

           volume  
0      25555934.0  
1      55073833.0  
2      77508430.0  
3     101168448.0  
4     121736214.0  
...       