# Eel imports

Now let's take a look at a cut of data on eel product imports. The data come from [a foreign trade database maintained by NOAA](https://www.st.nmfs.noaa.gov/commercial-fisheries/foreign-trade/).

The CSV file lives here: `../data/eels.csv`.

We'll start by importing pandas and creating a data frame.

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('../data/eels.csv')

In [3]:
df.head()

Unnamed: 0,year,month,country,product,kilos,dollars
0,2010,1,CHINA,EELS FROZEN,49087,393583
1,2010,1,JAPAN,EELS FRESH,263,7651
2,2010,1,TAIWAN,EELS FROZEN,9979,116359
3,2010,1,VIETNAM,EELS FRESH,1938,10851
4,2010,1,VIETNAM,EELS FROZEN,21851,69955


### Check out the values

Now let's poke through the values in each column to see what we're working with using a combination of `unique()`, `min()` and `max()`. Questions we're trying to answer here: What years and months are represented? What countries are in the data? Do the numeric data make sense? Are there any obvious errors or typos to handle? Are there any holes in our data (use `info()`)?

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 805 entries, 0 to 804
Data columns (total 6 columns):
year       805 non-null int64
month      805 non-null int64
country    805 non-null object
product    805 non-null object
kilos      805 non-null int64
dollars    805 non-null int64
dtypes: int64(4), object(2)
memory usage: 37.8+ KB


In [6]:
df.year.unique()

array([2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017])

In [7]:
df.month.unique()

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12])

In [8]:
df.country.unique()

array(['CHINA', 'JAPAN', 'TAIWAN', 'VIETNAM', 'PORTUGAL', 'THAILAND',
       'SOUTH KOREA', 'CANADA', 'SENEGAL', 'NEW ZEALAND', 'POLAND',
       'SPAIN', 'BANGLADESH', 'NORWAY', 'MEXICO', 'PHILIPPINES',
       'PAKISTAN', 'PANAMA', 'CHILE', 'BURMA', 'UKRAINE',
       'CHINA - HONG KONG', 'COSTA RICA', 'INDIA'], dtype=object)

In [9]:
# have to use bracket notation bc "product" is a pandas function
df['product'].unique()

array(['EELS FROZEN', 'EELS FRESH',
       'EELS STICKS TYPE PRODUCTS NOT COOKED NOT IN OIL',
       'EELS IN ATC NOT IN OIL', 'EELS IN OIL >7KG',
       'EELS IN OIL NOT >7KG',
       'EELS STICKS TYPE PRODUCTS COOKED OR IN OIL'], dtype=object)

**Question:** What does "ATC" stand for? _Always ask, never assume._

![atc](../img/eel-q.png "atc")

In [9]:
print(df.kilos.max())
print(df.kilos.min())

427935
13


In [10]:
print(df.dollars.max())
print(df.dollars.min())

6850258
2002


### Time-series data: Check for completeness

Each row in our data is one month's worth of shipments of a particular eel product from a particular country to the U.S. Which means we might want to do some time-based comparisons, so we need to check that we're dealing with complete years.

So first let's think about what we want to see: For each year that's present in our data, we want a unique list of months for those records. If we were in Excel, we might pivot to group the data by `year` and then throw `month` in the "columns" section to see what months are present for each year.

Here, we're going to do something similar:
- Select just the columns we're interested in
- Use the pandas `groupby()` method to group the records by year ([see this notebook for reference](../reference/Grouping%20data%20in%20pandas.ipynb))
- For each set of grouped data, use the pandas `unique()` method on the month column to see what months are present

When we call `groupby()` on a data frame, it returns a collection of items; each item in that collection is a Python data container with two elements: the _grouping_ value (year, in this case) and a dataframe of records that belong to that group (all records where year == that year).

So we can use a _for loop_ to iterate over the results and check each year.

👉For more details on _for loops_, [see this notebook](../appendix/Python%20data%20types%20and%20basic%20syntax.ipynb).

In [11]:
yearmonth = df[['year', 'month']]

for yeargroup in yearmonth.groupby('year'):
    print(yeargroup[0], yeargroup[1].month.unique())

2010 [ 1  2  3  4  5  6  7  8  9 10 11 12]
2011 [ 1  2  3  4  5  6  7  8  9 10 11 12]
2012 [ 1  2  3  4  5  6  7  8  9 10 11 12]
2013 [ 1  2  3  4  5  6  7  8  9 10 11 12]
2014 [ 1  2  3  4  5  6  7  8  9 10 11 12]
2015 [ 1  2  3  4  5  6  7  8  9 10 11 12]
2016 [ 1  2  3  4  5  6  7  8  9 10 11 12]
2017 [1 2 3 4 5]


So now we know that we have incomplete data for 2017 -- _news we can use_ as we start our analysis.

### Come up with a list of questions

- In this data, what country ships the most eel products of any type to the U.S.?
- Same question but broken out by year.
- For each country, what was the percent change in eel shipments of all types from 2010-2016?
- What type of product is most popular?

### Q: Who ships the most eels to the U.S. (in kilos)?

We'll use our good friends `groupby()`, `sum()` and `sort_values()`to find out.

In [12]:
df[['country', 'kilos']].groupby('country') \
                        .sum() \
                        .sort_values('kilos', ascending=False)

Unnamed: 0_level_0,kilos
country,Unnamed: 1_level_1
CHINA,15965996
VIETNAM,637737
TAIWAN,442740
JAPAN,361364
CANADA,346075
SOUTH KOREA,243540
THAILAND,137556
PORTUGAL,41453
PAKISTAN,22453
MEXICO,20860


### Q: Who ships the most? (Broken out by year)

Now we want to create a table where the rows are countries, the columns are years, and the values are sums for that country, that year.

If we were doing this in an Excel pivot table, we'd just add "year" to the columns section. To do this in pandas, we're ... also going to use a pivot table. (Yes! Pandas has a [pivot table function](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.pivot_table.html).)

We are going to hand the `pivot_table()` function five arguments:
1. The data we're pivoting (`df`)
2. The name of the column whose values we're doing math on (`values='kilos'`)
3. The type of aggregation to apply to the values -- default is `mean` and we do not want that (`aggfunc='sum'`)
4. The name of the column we're grouping on (`index='country'`)
5. The name of the column whose values will become the columns of our table (`columns='year'`)

Optionally, we can then use `sort_values()` to sort the pivot table that results by our most recent year of data.

In [13]:
pivoted_sums = pd.pivot_table(df,
                              aggfunc='sum',
                              values='kilos',
                              index='country',
                              columns='year')

pivoted_sums.sort_values(2017, ascending=False)

year,2010,2011,2012,2013,2014,2015,2016,2017
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
CHINA,372397.0,249232.0,1437392.0,1090135.0,1753140.0,4713882.0,4578546.0,1771272.0
TAIWAN,73842.0,,53774.0,39752.0,83478.0,48272.0,99535.0,44087.0
SOUTH KOREA,42929.0,41385.0,28146.0,27353.0,37708.0,8386.0,14729.0,42904.0
JAPAN,1326.0,2509.0,32255.0,105758.0,40177.0,69699.0,71748.0,37892.0
THAILAND,2866.0,5018.0,9488.0,4488.0,15110.0,41771.0,26931.0,31884.0
VIETNAM,63718.0,155488.0,118063.0,100828.0,38112.0,36859.0,96179.0,28490.0
CANADA,13552.0,24968.0,110796.0,44455.0,31546.0,28619.0,68568.0,23571.0
PORTUGAL,2081.0,3672.0,2579.0,2041.0,7215.0,8013.0,9105.0,6747.0
PANAMA,,,,11849.0,,,,974.0
BANGLADESH,,,13.0,,,600.0,,


### Q: What was the percent change in shipments for each country from 2010-2016?

For this question, we'll re-use the pivot table we just made and add a calculated column. First, though, we need to filter the table to include only records where the `2010` and `2016` values are not null, using the [`notnull()` method](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.notnull.html). (Looks like just filtering for "2010 is not null" does the trick.)

**If you didn't know about `.notnull()` already, how would you Google to find the answer?**

In [14]:
pivoted_sums_notnull = pivoted_sums[pivoted_sums[2010].notnull()]

pivoted_sums_notnull

year,2010,2011,2012,2013,2014,2015,2016,2017
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
CANADA,13552.0,24968.0,110796.0,44455.0,31546.0,28619.0,68568.0,23571.0
CHINA,372397.0,249232.0,1437392.0,1090135.0,1753140.0,4713882.0,4578546.0,1771272.0
JAPAN,1326.0,2509.0,32255.0,105758.0,40177.0,69699.0,71748.0,37892.0
PORTUGAL,2081.0,3672.0,2579.0,2041.0,7215.0,8013.0,9105.0,6747.0
SOUTH KOREA,42929.0,41385.0,28146.0,27353.0,37708.0,8386.0,14729.0,42904.0
TAIWAN,73842.0,,53774.0,39752.0,83478.0,48272.0,99535.0,44087.0
THAILAND,2866.0,5018.0,9488.0,4488.0,15110.0,41771.0,26931.0,31884.0
VIETNAM,63718.0,155488.0,118063.0,100828.0,38112.0,36859.0,96179.0,28490.0


Now we can add a column -- `10to16pctchange`. The syntax, and the math -- new value minus old value divided by old value -- are relatively straightforward: 

`dataframe['new_column'] = (dataframe['new_value'] - dataframe['old_value']) / dataframe['old_value'] * 100`

You might get a warning about slices vs. copies. You can ignore that for now.

In [15]:
pivoted_sums_notnull['10to16pctchange'] = (pivoted_sums_notnull[2016] - pivoted_sums_notnull[2010]) / pivoted_sums_notnull[2010]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [16]:
pivoted_sums_notnull.sort_values('10to16pctchange')

year,2010,2011,2012,2013,2014,2015,2016,2017,10to16pctchange
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
SOUTH KOREA,42929.0,41385.0,28146.0,27353.0,37708.0,8386.0,14729.0,42904.0,-0.656899
TAIWAN,73842.0,,53774.0,39752.0,83478.0,48272.0,99535.0,44087.0,0.347946
VIETNAM,63718.0,155488.0,118063.0,100828.0,38112.0,36859.0,96179.0,28490.0,0.509448
PORTUGAL,2081.0,3672.0,2579.0,2041.0,7215.0,8013.0,9105.0,6747.0,3.3753
CANADA,13552.0,24968.0,110796.0,44455.0,31546.0,28619.0,68568.0,23571.0,4.059622
THAILAND,2866.0,5018.0,9488.0,4488.0,15110.0,41771.0,26931.0,31884.0,8.39672
CHINA,372397.0,249232.0,1437392.0,1090135.0,1753140.0,4713882.0,4578546.0,1771272.0,11.294798
JAPAN,1326.0,2509.0,32255.0,105758.0,40177.0,69699.0,71748.0,37892.0,53.108597


### Q: What type of product is most popular (in kilos)?

We'll use `groupby()`, `sum()` and `sort_values()` again.

In [17]:
pop_products = df[['product', 'kilos']].groupby('product') \
                                       .sum() \
                                       .sort_values('kilos', ascending=False)

In [18]:
pop_products

Unnamed: 0_level_0,kilos
product,Unnamed: 1_level_1
EELS IN ATC NOT IN OIL,12823744
EELS IN OIL NOT >7KG,2653729
EELS FROZEN,2265899
EELS STICKS TYPE PRODUCTS NOT COOKED NOT IN OIL,222295
EELS STICKS TYPE PRODUCTS COOKED OR IN OIL,111091
EELS FRESH,108699
EELS IN OIL >7KG,98880


### What other questions do you want to explore?