# Case Study

## Part 1
*To be completed at the conclusion of Day 1*

For the following exercises, you should use the data stored at `../data/companies.csv`
You aren't expected to finish all the exercises; just get through as many as time allows and we will review them together.

1. Start by becoming familiar with the data. How many rows and how many columns does it have? What are the data types of the columns?

In [1]:
import pandas as pd
comp_df = pd.read_csv('../data/companies.csv')
comp_df.head()

Unnamed: 0,Symbol,Name,Sector
0,MMM,3M Company,Industrials
1,AOS,A.O. Smith Corp,Industrials
2,ABT,Abbott Laboratories,Health Care
3,ABBV,AbbVie Inc.,Health Care
4,ACN,Accenture plc,Information Technology


In [2]:
comp_df.shape

(505, 3)

In [3]:
comp_df.columns

Index(['Symbol', 'Name', 'Sector'], dtype='object')

In [4]:
comp_df.dtypes

Symbol    object
Name      object
Sector    object
dtype: object

2. Set the data's index to be the "Symbol" column.

In [5]:
#comp_df index doesn't change
#we're creating a new data frame that has Symbol as the index
#new data frame is allowed to have the same name as the original, if you want
comp_sym_df = comp_df.set_index('Symbol')
comp_sym_df.head()

Unnamed: 0_level_0,Name,Sector
Symbol,Unnamed: 1_level_1,Unnamed: 2_level_1
MMM,3M Company,Industrials
AOS,A.O. Smith Corp,Industrials
ABT,Abbott Laboratories,Health Care
ABBV,AbbVie Inc.,Health Care
ACN,Accenture plc,Information Technology


3. Look up the company with the symbol NCLH. What company is this? What sector is it in?

In [6]:
#filter the original data frame using a condition
#returns a boolean value for each row
comp_df_filter = comp_df['Symbol'] == 'NCLH'
comp_df[comp_df_filter].head()

Unnamed: 0,Symbol,Name,Sector
343,NCLH,Norwegian Cruise Line,Consumer Discretionary


In [20]:
#or slice using the Symbol index in the new data frame
#returns object... just the data from the requested slice
comp_sym_df.loc['NCLH']

Name       Norwegian Cruise Line
Sector    Consumer Discretionary
Name: NCLH, dtype: object

In [7]:
#or slice using the Symbol index in the new data frame
#returns data frame... just the data from the requested slice
comp_sym_df.loc[['NCLH']]

Unnamed: 0_level_0,Name,Sector
Symbol,Unnamed: 1_level_1,Unnamed: 2_level_1
NCLH,Norwegian Cruise Line,Consumer Discretionary


4. Filter down to companies that *either* in the "Consumer Discretionary" or the "Consumer Staples" sectors.

In [8]:
sect_1 = comp_df['Sector'] == 'Consumer Discretionary'
sect_2 = comp_df['Sector'] == 'Consumer Staples'
sect_1or2 = sect_1 | sect_2
sect_1or2

0      False
1      False
2      False
3      False
4      False
       ...  
500    False
501     True
502    False
503    False
504    False
Name: Sector, Length: 505, dtype: bool

5. How many companies are left in the data now?

In [9]:
sum(sect_1or2)

116

In [10]:
comp_df[sect_1or2]

Unnamed: 0,Symbol,Name,Sector
8,AAP,Advance Auto Parts,Consumer Discretionary
29,MO,Altria Group Inc,Consumer Staples
30,AMZN,Amazon.com Inc.,Consumer Discretionary
53,APTV,Aptiv Plc,Consumer Discretionary
54,ADM,Archer-Daniels-Midland Co,Consumer Staples
...,...,...,...
481,WBA,Walgreens Boots Alliance,Consumer Staples
491,WHR,Whirlpool Corp.,Consumer Discretionary
494,WYN,Wyndham Worldwide,Consumer Discretionary
495,WYNN,Wynn Resorts Ltd,Consumer Discretionary


6. Create a new column, "Symbol_Length", that is the length of the symbol of each company. *Hint: you may need to reset an index along the way.*

In [11]:
comp_df = comp_df.reset_index()
comp_df.head()

Unnamed: 0,index,Symbol,Name,Sector
0,0,MMM,3M Company,Industrials
1,1,AOS,A.O. Smith Corp,Industrials
2,2,ABT,Abbott Laboratories,Health Care
3,3,ABBV,AbbVie Inc.,Health Care
4,4,ACN,Accenture plc,Information Technology


In [12]:
symbol_len = comp_df['Symbol'].str.len()
symbol_len

0      3
1      3
2      3
3      4
4      3
      ..
500    3
501    3
502    3
503    4
504    3
Name: Symbol, Length: 505, dtype: int64

In [13]:
comp_df['Symbol_Length'] = symbol_len
comp_df.head()

Unnamed: 0,index,Symbol,Name,Sector,Symbol_Length
0,0,MMM,3M Company,Industrials,3
1,1,AOS,A.O. Smith Corp,Industrials,3
2,2,ABT,Abbott Laboratories,Health Care,3
3,3,ABBV,AbbVie Inc.,Health Care,4
4,4,ACN,Accenture plc,Information Technology,3


7. Find the company named "Kroger Co.". Change its name to "The Kroger Company".

In [14]:
comp_df.loc[comp_df['Name'] == 'Kroger Co.'].head()

Unnamed: 0,index,Symbol,Name,Sector,Symbol_Length
275,275,KR,Kroger Co.,Consumer Staples,2


In [15]:
comp_df.loc[(comp_df.Name == 'Kroger Co.'),'Name']='The Kroger Company'
comp_df.loc[comp_df['Name'] == 'The Kroger Company'].head()

Unnamed: 0,index,Symbol,Name,Sector,Symbol_Length
275,275,KR,The Kroger Company,Consumer Staples,2


In [22]:
#above works without the .loc
#per instructor, it's better to use .loc syntax
#so python knows to pull back the whole row of data
comp_df[comp_df['Name'] == 'The Kroger Company'].head()

Unnamed: 0,index,Symbol,Name,Sector,Symbol_Length
275,275,KR,The Kroger Company,Consumer Staples,2


In [16]:
#lines above work fine
#trying another approach to see if it works
import pandas as pd
comp_df_2 = pd.read_csv('../data/companies.csv')
comp_df_2.head()

comp_df_kroger = comp_df_2['Name'] == 'Kroger Co.'
comp_df_2[comp_df_kroger].head()

Unnamed: 0,Symbol,Name,Sector
275,KR,Kroger Co.,Consumer Staples


In [17]:
#numpy approach works too!
import numpy as np
comp_df_2['Name'] = np.where((comp_df_2.Name == 'Kroger Co.'),'The Kroger Company',comp_df_2.Name)
comp_df_2[comp_df_kroger].head()

Unnamed: 0,Symbol,Name,Sector
275,KR,The Kroger Company,Consumer Staples


**Bonus**: *For these two exercises, you won't find examples of the solution in our notebooks.
You'll need to search for help on the internet.*

*Don't worry if you aren't able to solve them.*

1. Filter down to companies whose symbol starts with A. How many companies meet this criterion?
2. What is the longest company name remaining in the dataset? You could just search the data visually, but try to find a programmatic solution.

In [18]:
filtered = comp_df['Name'].str.match(pat = '^A')
filtered

0      False
1       True
2       True
3       True
4       True
       ...  
500    False
501    False
502    False
503    False
504    False
Name: Name, Length: 505, dtype: bool

In [19]:
comp_df[filtered].head()

Unnamed: 0,index,Symbol,Name,Sector,Symbol_Length
1,1,AOS,A.O. Smith Corp,Industrials,3
2,2,ABT,Abbott Laboratories,Health Care,3
3,3,ABBV,AbbVie Inc.,Health Care,4
4,4,ACN,Accenture plc,Information Technology,3
5,5,ATVI,Activision Blizzard,Information Technology,4


In [23]:
#instructors showed this approach
filtered2 = comp_df['Name'].str.startswith(pat = 'A')
filtered2

0      False
1       True
2       True
3       True
4       True
       ...  
500    False
501    False
502    False
503    False
504    False
Name: Name, Length: 505, dtype: bool

In [24]:
comp_df[filtered2].head()

Unnamed: 0,index,Symbol,Name,Sector,Symbol_Length
1,1,AOS,A.O. Smith Corp,Industrials,3
2,2,ABT,Abbott Laboratories,Health Care,3
3,3,ABBV,AbbVie Inc.,Health Care,4
4,4,ACN,Accenture plc,Information Technology,3
5,5,ATVI,Activision Blizzard,Information Technology,4


## Part 2
*To be completed at the conclusion of Day 2*

This section again uses the data at `../data/companies.csv`.

1. Re-create the "Symbol_Length" column (see above).
2. What is the average symbol length of companies in the data set?
3. What is the average symbol length by sector? That is, after grouping by sector, what is the average symbol length for each group?
4. How long is the longest company name? How long is the longest company name by sector?

Now open the pricing data at `../data/prices.csv`.
Note that this data is entirely fabricated and does not exhibit the qualities of real stock market data!

1. Become familiar with this data. What is its shape? What are its data types?
2. Get summary metrics (count, min, max, standard deviation, etc) for both the Price and Quarter columns. *Hint: we saw a method of DataFrames that will do this for you in a single line.*
3. Perform an inner join between this data set and the companies data, on the Symbol column.
4. How many rows does our data have now?
5. What do you think this data represents? Form a hypothesis and look through the data more carefully until you are confident you understand what it is and how it is structured.
6. Group the data by sector. What is the average first quarter price for a company in the Real Estate sector? What is the minimum fourth quarter price for a company in the Industrials sector?
7. Filter the data down to just prices for Apple, Google, Microsoft, and Amazon.
8. Save this data as big_4.csv in the `../data` directory.
9. Using Seaborn, plot the price of these companies over 4 quarters. Encode the quarter as the x-axis, the price as the y-axis, and the company symbol as the hue.

**Bonus**:

This data is in a form that is useful for plotting.
But in this shape, it would be quite difficult to calculate the difference between each company's fourth quarter price and its first quarter price.

Reshape this data so it is of a form like the below:

| Symbol | Name | Sector | Q1 | Q2 | Q3 | Q4 |
|--------|------|--------|----|----|----|----|
| AAPL   | Apple Inc. | Information Technology | 275.20 | 269.96 | 263.51 | 266.07

From which we could easily calculate Q4 - Q1.

*You will probably want to google something like "python reshaping data". This is a very challenging problem!*