# Manipulating DataFrames

Regardless of the original data source, once you have data loaded into a DataFrame, you gain the ability to manipulate your data. For instance, you can

  - Reshape a DataFrame
    - sort
    - drop columns
    - `melt()` and `pivot()`
  - Select rows using
    - Logical criteria
    - `head()` and `tail()`
    - `iloc()`
  - Select columns
  - Handle missing data with `dropna()` and `fillna()`
  - Make new columns
  - Combine datasets with `merge()`
  - Group data
    
To explore this, we'll read in a data frame consisting of the salaries and personal statistics of major league baseball players. 

In [1]:
import pandas as pd

players_df = pd.read_csv("https://tf-assets-prod.s3.amazonaws.com/tf-curric/data-science/players.csv")

First, we should examine our DataFrame using some of the techniques we looked at in the previous checkpoint.

In [2]:
players_df.shape

(26428, 14)

In [3]:
players_df.describe()

Unnamed: 0,birthyear,deathyear,weight,height,yearid,salary
count,26428.0,492.0,26428.0,26428.0,26428.0,26428.0
mean,1971.389133,2008.833333,199.022136,73.509006,2000.878727,2085634.0
std,9.679736,7.188803,22.631696,2.284665,8.909314,3455348.0
min,1925.0,1989.0,140.0,66.0,1985.0,0.0
25%,1964.0,2006.0,185.0,72.0,1994.0,294702.0
50%,1971.0,2011.0,195.0,74.0,2001.0,550000.0
75%,1979.0,2015.0,215.0,75.0,2009.0,2350000.0
max,1995.0,2018.0,315.0,83.0,2016.0,33000000.0


In [4]:
players_df.columns

Index(['playerid', 'birthyear', 'birthcountry', 'deathyear', 'namefirst',
       'namelast', 'weight', 'height', 'bats', 'throws', 'yearid', 'teamid',
       'lgid', 'salary'],
      dtype='object')

In [5]:
players_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26428 entries, 0 to 26427
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   playerid      26428 non-null  object 
 1   birthyear     26428 non-null  int64  
 2   birthcountry  26428 non-null  object 
 3   deathyear     492 non-null    float64
 4   namefirst     26428 non-null  object 
 5   namelast      26428 non-null  object 
 6   weight        26428 non-null  int64  
 7   height        26428 non-null  int64  
 8   bats          26428 non-null  object 
 9   throws        26428 non-null  object 
 10  yearid        26428 non-null  int64  
 11  teamid        26428 non-null  object 
 12  lgid          26428 non-null  object 
 13  salary        26428 non-null  int64  
dtypes: float64(1), int64(5), object(8)
memory usage: 2.8+ MB


In [6]:
players_df

Unnamed: 0,playerid,birthyear,birthcountry,deathyear,namefirst,namelast,weight,height,bats,throws,yearid,teamid,lgid,salary
0,barkele01,1955,USA,,Len,Barker,225,77,R,R,1985,ATL,NL,870000
1,bedrost01,1957,USA,,Steve,Bedrosian,200,75,R,R,1985,ATL,NL,550000
2,benedbr01,1955,USA,,Bruce,Benedict,175,73,R,R,1985,ATL,NL,545000
3,campri01,1953,USA,2013.0,Rick,Camp,195,73,R,R,1985,ATL,NL,633333
4,ceronri01,1954,USA,,Rick,Cerone,192,71,R,R,1985,ATL,NL,625000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26423,strasst01,1988,USA,,Stephen,Strasburg,235,76,R,R,2016,WAS,NL,10400000
26424,taylomi02,1991,USA,,Michael,Taylor,210,75,R,R,2016,WAS,NL,524000
26425,treinbl01,1988,USA,,Blake,Treinen,225,77,R,R,2016,WAS,NL,524900
26426,werthja01,1979,USA,,Jayson,Werth,235,77,R,R,2016,WAS,NL,21733615


### Missing values

By a `null` value, `pandas` means a value that is missing. Null values are represented in `pandas` as `NaN`. A null value does *not* necessarily mean that the number is zero. It could be missing due to recording error, its not being applicable, a sampling bias, etc. In later modules we will take a closer look at how to diagnose missing values.  

The `.isnull()` method returns a Boolean result indicating whether each value in a DataFrame is missing. `True` equates to a missing value. 

In [7]:
players_df.isnull()

Unnamed: 0,playerid,birthyear,birthcountry,deathyear,namefirst,namelast,weight,height,bats,throws,yearid,teamid,lgid,salary
0,False,False,False,True,False,False,False,False,False,False,False,False,False,False
1,False,False,False,True,False,False,False,False,False,False,False,False,False,False
2,False,False,False,True,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,True,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26423,False,False,False,True,False,False,False,False,False,False,False,False,False,False
26424,False,False,False,True,False,False,False,False,False,False,False,False,False,False
26425,False,False,False,True,False,False,False,False,False,False,False,False,False,False
26426,False,False,False,True,False,False,False,False,False,False,False,False,False,False


Immediately we can see that there are quite a few missing values in the `deathyear` field. It makes sense that there are missing values here -- if a player is still living, then their death year is indeed unknown. Making this connection shows the importance of understanding how data is collected -- the data itself may not provide the answers.

Should we want to exclude all records with a missing value, we can call the `dropna()` method. By default, this will drop any row which includes a missing value, in any field. There are many other ways to drop missing values in `pandas` -- to learn more, check the [official docs](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html).

In [8]:
players_df_filtered = players_df.dropna()
players_df_filtered.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 492 entries, 3 to 25988
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   playerid      492 non-null    object 
 1   birthyear     492 non-null    int64  
 2   birthcountry  492 non-null    object 
 3   deathyear     492 non-null    float64
 4   namefirst     492 non-null    object 
 5   namelast      492 non-null    object 
 6   weight        492 non-null    int64  
 7   height        492 non-null    int64  
 8   bats          492 non-null    object 
 9   throws        492 non-null    object 
 10  yearid        492 non-null    int64  
 11  teamid        492 non-null    object 
 12  lgid          492 non-null    object 
 13  salary        492 non-null    int64  
dtypes: float64(1), int64(5), object(8)
memory usage: 57.7+ KB


Our DataFrame now contains only 492 rows -- all rows where the player's death year is unknown has been dropped. 

Let's say that instead of dropping these rows, we want to fill in the missing values with something other than `NaN`. We can do so with the `fillna()` method. 

This can be useful in making a DataFrame more legible, or in assigning a special value or character to nulls. 


In [9]:
players_df.fillna('To be determined')

Unnamed: 0,playerid,birthyear,birthcountry,deathyear,namefirst,namelast,weight,height,bats,throws,yearid,teamid,lgid,salary
0,barkele01,1955,USA,To be determined,Len,Barker,225,77,R,R,1985,ATL,NL,870000
1,bedrost01,1957,USA,To be determined,Steve,Bedrosian,200,75,R,R,1985,ATL,NL,550000
2,benedbr01,1955,USA,To be determined,Bruce,Benedict,175,73,R,R,1985,ATL,NL,545000
3,campri01,1953,USA,2013,Rick,Camp,195,73,R,R,1985,ATL,NL,633333
4,ceronri01,1954,USA,To be determined,Rick,Cerone,192,71,R,R,1985,ATL,NL,625000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26423,strasst01,1988,USA,To be determined,Stephen,Strasburg,235,76,R,R,2016,WAS,NL,10400000
26424,taylomi02,1991,USA,To be determined,Michael,Taylor,210,75,R,R,2016,WAS,NL,524000
26425,treinbl01,1988,USA,To be determined,Blake,Treinen,225,77,R,R,2016,WAS,NL,524900
26426,werthja01,1979,USA,To be determined,Jayson,Werth,235,77,R,R,2016,WAS,NL,21733615


## Add new columns

Sometimes, you'll need to add new columns by deriving values from calculations involving other columns. For example, if we had data on the weight and height of all players on a team, we could compute a new column for body mass index by using the formula:

$$ bmi = \frac{weight * 703}{height^2} $$

To create a column of values calculated from the values in other columns use the `assign()` method of the DataFrame.

In [10]:
players_df = players_df.assign(bmi = (703 * players_df.weight) / (players_df.height**2))
players_df.head()

Unnamed: 0,playerid,birthyear,birthcountry,deathyear,namefirst,namelast,weight,height,bats,throws,yearid,teamid,lgid,salary,bmi
0,barkele01,1955,USA,,Len,Barker,225,77,R,R,1985,ATL,NL,870000,26.678192
1,bedrost01,1957,USA,,Steve,Bedrosian,200,75,R,R,1985,ATL,NL,550000,24.995556
2,benedbr01,1955,USA,,Bruce,Benedict,175,73,R,R,1985,ATL,NL,545000,23.085945
3,campri01,1953,USA,2013.0,Rick,Camp,195,73,R,R,1985,ATL,NL,633333,25.724339
4,ceronri01,1954,USA,,Rick,Cerone,192,71,R,R,1985,ATL,NL,625000,26.77564


Or to simply add a new column with some default:

In [11]:
players_df['X'] = 0
players_df.head()

Unnamed: 0,playerid,birthyear,birthcountry,deathyear,namefirst,namelast,weight,height,bats,throws,yearid,teamid,lgid,salary,bmi,X
0,barkele01,1955,USA,,Len,Barker,225,77,R,R,1985,ATL,NL,870000,26.678192,0
1,bedrost01,1957,USA,,Steve,Bedrosian,200,75,R,R,1985,ATL,NL,550000,24.995556,0
2,benedbr01,1955,USA,,Bruce,Benedict,175,73,R,R,1985,ATL,NL,545000,23.085945,0
3,campri01,1953,USA,2013.0,Rick,Camp,195,73,R,R,1985,ATL,NL,633333,25.724339,0
4,ceronri01,1954,USA,,Rick,Cerone,192,71,R,R,1985,ATL,NL,625000,26.77564,0


Both of those options are row-wise operators.

## Deleting Columns

To remove a column, use the `drop()` method. This method can be used to drop both rows and columns so the axes have to be specified. Let's remove the new column we just created. The `drop()` method by default does not actually modify the DataFrame, rather it creates a new DataFrame with the changes leaving the existing one untouched. This can be useful in many cases, but you need to be aware of the behavior. 

As an alternative, you can set the `inplace` parameter to `True` to make the operation work on the existing DataFrame. In the following code snippet, we first drop a column and assign the new modified DataFrame to a new variable named `players_df_changed`. The original DataFrame named `players_df` still exists. Next, the original DataFrame is modified by dropping the same column while passing the `inplace` parameter set to `True`.

In [21]:
# delete the X column and store the new DataFrame in a new variable
players_df_changed = players_df.drop(['X'], axis=1)

print('The columns of the new DataFrame')
print(players_df_changed.columns)

print('The columns of the original DataFrame')
print(players_df.columns)

# Alternatively, modify the DataFrame inplace
players_df.drop(['X'], axis=1, inplace=True)
print('The columns of the DataFrame after deleting inplace')
print(players_df.columns)


KeyError: "['X'] not found in axis"

In [24]:
players_df.shape

(26428, 15)

## Selecting Rows

One thing you'll need to do constantly when exploring data in Pandas is selecting a subset of rows based on some criterion. Suppose, for instance, that you need to see the data for just a single year? Or maybe you need to select some rows by numbers?

### Selecting by Number

We have already seen that you can get the first `n` rows or the last `n` rows of a DataFrame using `head(n)` and `tail(n)` respectively. But what if you wanted to select rows 20 to 25? To select rows based on position use **iloc**.  



In [13]:
# select the first row
players_df.iloc[0]

playerid        barkele01
birthyear            1955
birthcountry          USA
deathyear             NaN
namefirst             Len
namelast           Barker
weight                225
height                 77
bats                    R
throws                  R
yearid               1985
teamid                ATL
lgid                   NL
salary             870000
bmi               26.6782
Name: 0, dtype: object

`iloc` may take inputs in a number of different ways:

 - an integer
 - a list of of integers
 - a slice object
 - a boolean array
 
When a single row is selected the result is a Series, that is, a single column of values indexed with the column names. When a list of integers is provided, the result is instead a DataFrame.

In [14]:
# select the first row, this time with a list of integers
players_df.iloc[[0]]

Unnamed: 0,playerid,birthyear,birthcountry,deathyear,namefirst,namelast,weight,height,bats,throws,yearid,teamid,lgid,salary,bmi
0,barkele01,1955,USA,,Len,Barker,225,77,R,R,1985,ATL,NL,870000,26.678192


To select more than one row, just list the row numbers.

In [15]:
players_df.iloc[[0, 1, 5, 8, 10]]

Unnamed: 0,playerid,birthyear,birthcountry,deathyear,namefirst,namelast,weight,height,bats,throws,yearid,teamid,lgid,salary,bmi
0,barkele01,1955,USA,,Len,Barker,225,77,R,R,1985,ATL,NL,870000,26.678192
1,bedrost01,1957,USA,,Steve,Bedrosian,200,75,R,R,1985,ATL,NL,550000,24.995556
5,chambch01,1948,USA,,Chris,Chambliss,195,73,L,R,1985,ATL,NL,800000,25.724339
8,garbege01,1947,USA,,Gene,Garber,175,70,R,R,1985,ATL,NL,772000,25.107143
10,hornebo01,1957,USA,,Bob,Horner,195,73,R,R,1985,ATL,NL,1500000,25.724339


In [20]:
players_df.groupby(by='birthyear').mean()['height']

birthyear
1925    72.000000
1939    73.000000
1941    71.000000
1942    74.000000
1943    75.000000
1944    73.642857
1945    71.461538
1946    74.222222
1947    73.617021
1948    72.914894
1949    73.506173
1950    72.421687
1951    73.834951
1952    73.017699
1953    73.431250
1954    73.162055
1955    73.498069
1956    73.435530
1957    73.247002
1958    73.466960
1959    73.193916
1960    73.273390
1961    73.329044
1962    73.582228
1963    73.450000
1964    72.956055
1965    73.274664
1966    73.630836
1967    73.395062
1968    73.541000
1969    73.268470
1970    73.758701
1971    72.994748
1972    73.595213
1973    73.481308
1974    73.625954
1975    73.495595
1976    73.537367
1977    73.540479
1978    73.655080
1979    73.637959
1980    73.706065
1981    74.282675
1982    73.660085
1983    73.642529
1984    73.865714
1985    73.700361
1986    74.021429
1987    73.607884
1988    73.963190
1989    73.672474
1990    73.365741
1991    73.535211
1992    73.796610
1993    73.526316


To select multiple rows with sequential index use a *slice* object. In Python, a *slice* object takes the form

```
[start:end]
```

where `start` and `end` are integers. It is basically a list of numbers from `start` to `end-1`. For example, `[2:5]` is the same as `[2, 3, 4]`.



In [None]:
# Select rows 5 to 10
players_df.iloc[5:11]

Alternatively, `iloc` may take a list of booleans that is the same length as the index and only return the rows that are `True`. For instance, if the list `[False, True, True, False]` is used to select from a DataFrame with 4 rows, then it will skip the first row, take the second and third rows, and skip the last row.

Our DataFrame has over 26 thousand rows so typing a list of 26 thousand boolean values is just not practical. However, we could generate a list based on some value in the DataFrame. For instance, suppose we want to get a list of all players that weight over 200 lbs. We could generate a list of booleans like this:

In [None]:
# create a Series of booleans
over_200 = players_df['weight'] > 200
over_200.head(10)

The `over_200` Series now contains a boolean value for each row in the DataFrame. We can see how many `True`s and how many `False`s with the `value_counts()` method.

In [None]:
over_200.value_counts()

To select only the players corresponding to a `True`, we can pass the array of booleans to `iloc`.

In [None]:
players_df.iloc[over_200.values]

The result is a DataFrame with 10,599 rows. Here is another example with multiple conditions. Select all players over 200 lbs who bat left-handed bats and and throw right-handed.

In [None]:
over_200_L_R = (players_df['weight'] > 200) & (players_df['bats'] == 'L') & (players_df['throws'] == 'R')
players_df.iloc[over_200_L_R.values]

### Selecting on both axes

Sometimes you want to select only some columns of the DataFrame. `iloc` allows that with a second value representing the second axis. Like the rows, the columns are numbered from 0. So to select the first and last name, the weight, height, bats and throws columns for the first 10 rows of the DataFrame, we would need to specify a slice of `[4:10]` for the columns.



In [None]:
# rows 0 through 9 and columns 5 through 10
players_df.iloc[0:10, 4:10]

To make the rest of this discussion a little easier to illustrate, let's select a subset of our data. Let's select all players that played in the team CLE in 2015. And let's limit ourselves to the following columns: playerid, name, weight, height, bats, throws, birthyear, and salary.

In [None]:
#create a Series of booleans for the row selection
cle_options = (players_df['teamid'] == 'CLE') & (players_df['yearid'] == 2015)
cle_2015 = players_df.iloc[cle_options.values, [0, 1, 4, 5, 6, 7, 8, 9, 13]]
cle_2015

## Sorting

To sort a DataFrame, we use `sort_values()`. It's possible to sort by the values in one or more columns and to sort either in ascending order or descending order. The `sort_values()` method takes several of parameters:

 - `by`: either the name of a single column or a list of names of columns
 - `axis`: either 0 to sort rows or 1 to sort columns, default 0.
 - `ascending`: `True`, or `False` for descending, defaults to `True`
 - `inplace`: `True` to sort inplace and modify the DataFrame, `False` to create a new DataFrame, defaults to False
 - `na_position`: where to put Nans, either 'first' or 'last', defaults to 'last'

In [None]:
# sort the players by weight from lowest value to highest value (ascending)
# accept all other defaults
weight_asc = cle_2015.sort_values(by='weight')
weight_asc

Suppose you want to sort by weight then by height. Look at the sorted DataFrame above. Notice that sometimes players with the same weight may have their heights in the wrong order. To fix this, we can simply provide a list of columns to sort by.

In [None]:
weight_height_asc = cle_2015.sort_values(by=['weight', 'height'])
weight_height_asc

Suppose that we wanted to study the salaries of all players across all teams that played in the five year period from 2011 to 2015? First, we can restrict the DataFrame to only those rows involving those years.

In [None]:
years = (players_df['yearid'] >= 2011) & (players_df['yearid'] <= 2015)
five_years = players_df.iloc[years.values, [4, 5, 10, 13]]
five_years.head()

Next, let's create a single column with each player's full name. This way we can use that column in the next step.

In [None]:
five_years = five_years.assign(fullname = five_years.namefirst + ' ' + five_years.namelast)
five_years.head()

At this point, the DataFrame contains the data we want, but it's not in a format that is easy to make sense of. There is a row for each year that a player played on a team. So each player appears multiple times in the DataFrame. For instance, we can look for 'Henry Blanco':

In [None]:
five_years.iloc[(five_years['fullname'] == 'Henry Blanco').values]

Henry Blanco played in 2011, 2012 and 2013. We can see his salary over those years too. But we want to be able to see this data for all players. It may be posible to sort by the player name, but comparing salaries for a single year is then diffcult. What would help is a pivot table, very similar to the pivot tables that you create in a spreadsheet. A pivot table reshapes the DataFrame from a long format to a wide format. We could create a table with the players' names in a single column and each year as a column. Each year column would then contain the salary for that player for that year.

## Grouping 

`groupby()` allows us to group records by one or more categorical variable. Typically we will want to chain some data manipulation method after `groupby()` to summarize the data. 

For example, let's say we wanted to calculate the average for our statistcs grouped by each year. We will first group the DataFrame by `yearid`, then calculate the averages with `mean()`:

In [None]:
players_df.groupby(['yearid']).mean()

For another example, we could calculate descriptive statistics for all records grouped by batting and throwing arms. Let's index the results just to retrieve statistics for `salary`.

In [None]:
players_df.groupby(['bats','throws'])['salary'].describe()

## Reshaping the DataFrame

### `pivot()`

The data that we have consists of a list of players and their salaries for the years that they played. To create a pivot table, we need to specify the index, that is, which column will be used to identify each row of the DataFrame. Then, we specify which column values are going to become the column names of the table. Finally, we specify which column will provide the values that go in the table cells.

One important consideration: the index column and column names must be unique in the DataFrame. If a player name and year is duplicated — maybe the player played for two different teams in the same year — we will get an error when trying to create the pivot table. There are many different ways we can deal with duplicates in the data and careful thought should be given to this problem. Here, we will simply drop all duplicates as a quick and simple solution.

In [None]:
# first get rid of duplicates that may cause problems
five_years = five_years.drop_duplicates(['fullname', 'yearid'])

# then create the pivot table
five_years_pivot = five_years.pivot(index='fullname', columns='yearid', values='salary')

print(five_years_pivot)

You may notice that the DataFrame now has `yearid` placed on top of `fullname`. This arrangement is known as hierarchical or multi-level index. For more on advanced indexing, check the [`pandas` docs](https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html).

To pivot the data so that it appears in a more familiar tabular format, chain the `reset_index()` method after you `pivot()` the DataFrame:

In [None]:
# then create the pivot table
five_years_pivot = five_years.pivot(index='fullname', columns='yearid', values='salary').reset_index()

print(five_years_pivot)

### `melt()`

With the `melt()` function we can re-shape a dataframe. 

This function takes many arguments following this form:

`pandas.melt(frame, id_vars=None, value_vars=None, var_name=None, value_name='value', col_level=None)`


Here's what each argument does, according to [`pandas`](https://www.geeksforgeeks.org/python-pandas-melt/) documentation (We'll focus on the first five.):


| Argument     | What it does                                                                               |
| ------------ | ------------------------------------------------------------------------------------------ |
| `frame`    | DataFrame.                                                    |
| `id_vars`    | Column(s) to use as identifier columns.                                                    |
| `value_vars` | Column(s) to unpivot. If not specified, uses all columns that are not set as `id_vars`.    |
| `var_name`   | Name to use for the `variable` column. If none it uses `frame.columns.name` or `variable`. |
| `value_name` | Name to use for the `value` column.                                                        |
| `col_level`  | If columns are a MultiIndex then use this level to melt.                                   |

Using `melt()`, how can we "un-pivot" the above `five_years_pivot`? Rather than having five columns for `2011`-`2015` indicating salary information, we just want one column called `year` and another called `salary`. 

That means that `fullname` is our `id_vars`, and the rest are our `value_vars`. We'll set `var_name` to `year` and `value_name` to `salary`, as previously specified. 



In [None]:
pd.melt(frame=five_years_pivot, id_vars=['fullname'],var_name='year',value_name='salary')

## Merging DataFrames

`merge()`

Joining two or more tables is one of the most common data preparation tasks, and there are many ways to do it. You will be introduced to the world of joins later in the course with SQL, and this will give you a solid grasp of how joins are conducted in `pandas`, because `pandas` joins are modeled off of those in SQL. 

For now, fortunately we can rely on the intuition of `pandas` to give us what we want when joining two tables.

The above DataFrame consisting of player information and salaries is actually the result of joining two individual DataFrames. We can reproduce this DataFrame using `merge()`. For now, call the `merge()` function and pass in the two DataFrames, assigning the results to a new DataFrame.

`pandas` will make the best guess on how to merge these DataFrames: if any two columns are named the same between them, that will serve as the basis for the join. In this case, the `playerID` field is shared between the two tables, thus `pandas` will join based on that connection.  





In [None]:
salaries = pd.read_csv("https://tf-assets-prod.s3.amazonaws.com/tf-curric/data-science/salaries.csv")
people = pd.read_csv("https://tf-assets-prod.s3.amazonaws.com/tf-curric/data-science/people.csv")

print(salaries.columns)
print(people.columns)

In [None]:
merged = pd.merge(salaries, people)

print(merged.columns)
print(merged.shape)

Indeed this is the same information as contained in the original DataFrame, `players_df`.

In [None]:
players_df = pd.read_csv("https://tf-assets-prod.s3.amazonaws.com/tf-curric/data-science/players.csv")

print(players_df.columns)
print(players_df.shape)

Should you need to perform another type of join, such as one when columns are not named the same, you can check the [`pandas`](https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html) documentation. However, as mentioned, the upcoming lessons on SQL will provide the logic behind how joins are conducted.  

## Case Study 1

Your manager would like to study the distribution of right-handed batters to right-handed throwers. That is, how many right-handed batters are also right-handed throwers? They're also interested in other combinations. You look at the original data and realize that you will have to perform some sort of count over all the rows where the the value for bats is 'right-handed' and the value for throws is 'right-handed' and similar for other combinations.

The data contains duplicate rows for each player, as a row actually represents a single year per player. We do not need all of that information so we could first drop the duplicate players.

In [None]:
players_no_duplicates = players_df.drop_duplicates(['playerid'])
players_no_duplicates.shape

That leaves 5,149 players. Your manager wants a table of data that shows the counts of each combination of bats and throws. This is an ideal scenario for a `crosstab`. To create a crosstab, simply provide the two  columns of data that you wish to cross tabulate and Pandas does the rest for you.

In [None]:
pd.crosstab(players_no_duplicates['bats'], players_no_duplicates['throws'])

## Case Study 2

Your company got some data about the fuel use of a number of vehicles in a relational database named **fueleconomy** on the same server as the previous database.  You would like to answer some questions about this data such as:

1. How many vehicles use each type of fuel by year?
2. What are the top-performing 6-cylinder vehicles in 2003 as measured by highway mileage?


First, we connect to the database and get the data into a DataFrame.


In [None]:
vehicles_df = pd.read_csv("https://tf-assets-prod.s3.amazonaws.com/tf-curric/data-science/vehicles.csv")

Next, let's look at some features of the data that we got, for instance, how many rows, what data types and so on.

In [None]:
vehicles_df.info()

Next, we can check how many years of data is contained in the DataFrame. That is, how many unique years are there?

In [None]:
vehicles_df['year'].value_counts()

Looks like we have data from 1985 to 2015. How many different types of fuel are there?

In [None]:
vehicles_df['fuel'].value_counts()

Then let us do a crosstab between these two columns and see what we learn.

In [None]:
pd.crosstab(vehicles_df['year'], vehicles_df['fuel'])

That gives us a useful look at the way different fuels are used over the years. Later, we will learn how to make charts out of this type of data. 

The second question will require a few steps. First we can select only the 6 cylinder vehicles from 2003.

In [None]:
bools = (vehicles_df['cyl'] == 6) & (vehicles_df['year'] == 2003)
six_cyl_df = vehicles_df.iloc[bools.values]
six_cyl_df.head()

Next, we can either sort the entire DataFrame by highway mileage in descending order or use the built in `nlargest()` method to select the n rows with the largest values in the highway column. We can look at both solutions just to compare approaches.

In [None]:
# sort the DataFrame by hwy in descending order
sorted_six_cyl_df = six_cyl_df.sort_values(by='hwy', ascending=False)

#select the first 10 rows
sorted_six_cyl_df.head(10)

So that did the trick. We have the top ten vehicles by highway mileage in 2003. Notice that there are several vehicles with the same mileage values. Let's see how  `nlargest` may work to do the same job.

`nlargest` is really just a shortcut. We need to provide:

 - `n`: the number of rows to return
 - `columns`: the columns to sort by
 - `keep`: one of 'first', 'last' or 'all' - this affects how we deal with duplicate values in the sort column. Defaults to 'first'.

In [None]:
# select the 10 rows with largest values in the hwy column
six_cyl_df.nlargest(10, columns='hwy')

The exact rows that are returned may not exactly match because of the way `nlargest` deals with duplicates. By default, it returns the first rows in the set of duplicates. Experiment with this feature by trying 'last' and 'all' as values to the `keep` parameter.