In [2]:
import pandas as pd

# Exercise 5.0: Importing the College Baseball Data Sets

The University of Hawai'i is proud of their Rainbow Warrior student atheletes! For the exercises in this chapter we will be working with four data sets related to the NCAA College Baseball DivisionI 2017 and 2018 seasons. These data sets and more are publicly available on the [NCAA website](https://www.ncaa.com/). The tables below summarize the columns contained in the data sets.


Team batting average: "Data/batting_2018.csv" and "Data/batting_2017.csv"

| Column |Description|
|:----------|-----------|
| `Rank`| Rank of the team based on the team's overall batting average |
| `Name` | Name of the team |
| `G` | Number of games played in the season  |
| `W.L` | Number of Wins-Number of Losses |
| `AB` | Number of times at bat |
| `H` | Number of hits in total |
| `BA` | The teams batting average |


Team scoring data: "Data/scoring_2018.csv" and "Data/scoring_2017.csv"

| Column |Description|
|:----------|-----------|
| `Rank`| Rank of the team based on the team's points per game |
| `Name` | Name of the team |
| `G` | Number of games played in the season  |
| `W.L` | Number of Wins-Number of Losses |
| `R` | Number of runs (i.e. points) in total |
| `PG` | The teams points per game average for the season |

Please run the code cell below before attempting the exercises in the chapter to load the data into your working environment.

In [34]:
batting_2018 = pd.read_csv("Data/batting_2018.csv")
batting_2017 = pd.read_csv("Data/batting_2017.csv")

scoring_2018 = pd.read_csv("Data/scoring_2018.csv")
scoring_2017 = pd.read_csv("Data/scoring_2017.csv")

Unnamed: 0,Rank,Name,G,W.L,AB,H,BA
0,1,Tennessee Tech,65,53-12,2370,788,0.332
1,2,UNC Greensboro,54,39-15,1917,624,0.326
2,3,Oregon St.,68,55-12-1,2345,753,0.321
3,4,Morehead St.,63,37-26,2325,734,0.316
4,5,New Mexico St.,62,40-22,2113,656,0.31


# Exercise 5.1: Merging Series

The following code cell builds two `Series`. One titled `team_names_2018`, which contain the team names competeing in the 2018 division I baseball season, and another titled `team_records_2018`, which contains the win-loss record. Both `Series` are indexed by the end of season ranking of the teams in the league.

In [23]:
team_names_2018 = scoring_2018.loc[:,'Name']
team_records_2018 = scoring_2018.loc[:,'W.L']

team_names_2018.index = scoring_2018.loc[:,'Rank']
team_records_2018.index = scoring_2018.loc[:,'Rank']

Use the code cell below to merge the two `Series`, `team_names_2018` and `team_records_2018` into a single `DataFrame` named, `team_names_and_records`. The `DataFrame` should be indexed by the team ranks and columns should be labeled `Name` and `W.L` for the entries in `team_names_2018` and `team_records_2018` respectively.

In [24]:
# Type your solution to Exercise 5.1 here
team_names_and_records = pd.DataFrame({'Name':team_names_2018, 'W.L':team_records_2018})

# Exercise 5.2: Merging DataFrames

Observe that the two `DataFrames` `batting_2018` and `batting_2017` have all of the same column labels. This needs to be carefully considered if we wanted to merge the two `DataFrames`. Using columns such as `Rank` as a key to merge would not make sense since the rank of a team can surely vary between the years. The team names however are not expected to change between the year (this is possible but highly unlikely) making it a good key to merge on.

Merge the `DataFrames` `batting_2018` and `batting_2017` using the common `Name` column as the key. Apply the suffixes '_2017' and '_2018' to the overlapping column names. Save the resulting `DataFrame` as `batting_2017_2018_df`. 

*Hint:* You can set the `suffixes` parameter of the `DataFrame` `merge()` method like so: `suffixes=('_2017', '_2018')` to obtain the desired column labels for the new `DataFrame`

In [29]:
# Type your solution to Exercise 5.2 here
batting_2017_2018_df = batting_2017.merge(batting_2018, on='Name', suffixes=('_2017', '_2018'))

Unnamed: 0,Rank_2017,Name,G_2017,W.L_2017,AB_2017,H_2017,BA_2017,Rank_2018,G_2018,W.L_2018,AB_2018,H_2018,BA_2018
0,1,Tennessee Tech,65,53-12,2370,788,0.332,1,65,53-12,2370,788,0.332
1,2,UNC Greensboro,54,39-15,1917,624,0.326,2,54,39-15,1917,624,0.326
2,3,Oregon St.,68,55-12-1,2345,753,0.321,3,68,55-12-1,2345,753,0.321
3,4,Morehead St.,63,37-26,2325,734,0.316,4,63,37-26,2325,734,0.316
4,5,New Mexico St.,62,40-22,2113,656,0.31,5,62,40-22,2113,656,0.31


# Exercise 5.3: Alternative Merging Strategies

It is common for teams to join and leave the division I college baseball league between seasos. Let us explore and see exactly how different the teams are between the 2018 and 2017 seasons.

First, lets us check the total number of teams competing in the league for each season.

In [26]:
print("2018 number of teams: {}".format(scoring_2018.Name.nunique()))
print("2017 number of teams: {}".format(scoring_2017.Name.nunique()))

2018 number of teams: 275
2017 number of teams: 278


We see from the output of the code cell above that clearly there is a difference in the 2017 and 2018 teams. Let us explore further to see the teams that are in symmetric difference of the two sets, i.e. the teams that competed in either 2017 or 2018 but not both. To do this we will utilize `Python`'s `symmetric_difference()` set method.

In [27]:
s_2018 = set(list(scoring_2018.Name.unique()))
s_2017 = set(list(scoring_2017.Name.unique()))

sd = s_2017.symmetric_difference(s_2018)

print("Set differnece between 2018 and 2017 teams. count: {0} \n \nteams: {1}".format(len(sd), sd))

Set differnece between 2018 and 2017 teams. count: 39 
 
teams: {'Marist', 'Central Mich.', 'Washington', 'Hofstra', 'Belmont', 'Buffalo', 'Loyola Marymount', 'Austin Peay', 'Virginia', 'Villanova', 'Samford', 'Wagner', 'Butler', 'Toledo', 'Lipscomb', 'SIUE', 'Mississippi St.', 'Utah', 'Davidson', 'New Mexico St.', 'Incarnate Word', 'UMass Lowell', 'NYIT', 'FIU', 'Sacramento St.', 'UC Riverside', 'Abilene Christian', 'Wichita St.', 'Ohio St.', 'Northwestern', 'UMES', 'George Washington', 'Western Mich.', 'VMI', 'Southern Ill.', 'Indiana', 'Grand Canyon', 'Miami (FL)', 'Maryland'}


In exercise 5.2 we used the default merging strategy of the `DataFrame` `merge()` method, and that was an inner join, i.e. the intersection of the keys are kept. This means that in exercise 5.2 all of the row entries for the teams listed in the output of the cell above were not reported in the resulting `DataFrame`. 

I) This could be a desirable outcome for many applications but suppose we were interested in keeping all of the teams that compete in *either* the 2017 season or 2018 season. Which merging strategy should be used?

A.
Inner Join

B.
Outer Join

C.
Left Join

D.
Right Join

E.
None Of The Above

**Correct Answer**
B.
Outer Join

**Explanation**
An outer join keeps the union of keys found in the right and left tables, i.e. an outer join keeps all of the values in the keys found in either the right table or the left table or both.

II) Use the code cell below to merge the `DataFrames` `batting_2018` and `batting_2017` using the common `Name` column as the key. Use the correct merging strategy so that all of the teams that competed in *either* the 2017 season or 2018 season are included in the resulting `DataFrame`. Apply the suffixes '_2017' and '_2018' to the overlapping column names. Save the resulting `DataFrame` as `batting_2017_2018_df`.

Hint: You can set the suffixes parameter of the `DataFrame` `merge()` method like so: `suffixes=('_2017', '_2018')` to obtain the desired column labels for the new `DataFrame`.

In [28]:
batting_2017_2018_df = batting_2017.merge(batting_2018, on='Name', how='outer', suffixes=('_2017', '_2018'))
batting_2017_2018_df.head()

Unnamed: 0,Rank_2017,Name,G_2017,W.L_2017,AB_2017,H_2017,BA_2017,Rank_2018,G_2018,W.L_2018,AB_2018,H_2018,BA_2018
0,1,Tennessee Tech,65,53-12,2370,788,0.332,1,65,53-12,2370,788,0.332
1,2,UNC Greensboro,54,39-15,1917,624,0.326,2,54,39-15,1917,624,0.326
2,3,Oregon St.,68,55-12-1,2345,753,0.321,3,68,55-12-1,2345,753,0.321
3,4,Morehead St.,63,37-26,2325,734,0.316,4,63,37-26,2325,734,0.316
4,5,New Mexico St.,62,40-22,2113,656,0.31,5,62,40-22,2113,656,0.31


# Exercise 5.4: Concatentating `DataFrames`

`scoring_by_name_2018` is a subset of `scoring_2018` indexed by `scoring_2018.Names` and created using the follwoing code.

```python
scoring_by_name_2018 = scoring_2018[['G', 'W.L', 'R', 'PG']]
scoring_by_name_2018.index = scoring_2018.Name
scoring_by_name_2018 = scoring_by_name_2018.sort_index()
```

`batting_by_name_2018` is a subset of `batting_2018` indexed by `batting_2018.Names` and created using the follwoing code.

```python
batting_by_name_2018 = batting_2018[['AB', 'H', 'BA']]
batting_by_name_2018.index = batting_2018.Name
batting_by_name_2018 = batting_by_name_2018.sort_index()
```

Which of the following code cells will perform a column based concatenation on the two `DataFrames`, `scoring_by_name_2018` and `batting_by_name_2018`? 

A.

```python
scoring_by_name_2018.concat(batting_by_name_2018, axis='columns')
```

B.

```python
pd.concat([scoring_by_name_2018, batting_by_name_2018], axis='columns')
```

C.

```python
pd.concat([scoring_by_name_2018, batting_by_name_2018])
```

D.

    None of the above

**Correct Answer**

B.

```python
pd.concat([scoring_by_name_2018, batting_by_name_2018], axis='columns')
```

**Explanation**

A. 

`concat()` is a `pandas` method, not a `DataFrame` method

B.

This solution properly uses the `concat()` method and sets the parameters to the correct values.

C.

The default setting of `concat()` will do a row based concatentation, which is not correct in this problem.

D.

The correct answer is B.