#### Changelog

1. Q13.3 How many clams that are wider than 0.4, have first name "Monster"?   
2. Q13.4 What was the second name of a clam captured at March, 8, that was longer than 0.65 and wider than 0.57?
3. Q15.2 Create a new column `area` using diameter, length and considering that a clam is a perfect rectangle.
4. Q19.4 `density_by_age` groupby `age_cat` and compute average density of a clam.

# Assignment 1. Sea Ears.
by Anvar Kurmukov,
updated by Bogdan Kirillov

---

By the end of this task you will be able to manipulate huge tabular data:
1. Compute different column's statistics (min, max, mean, quantiles etc.);
2. Select observations/features by condition/index;
3. Create new non-linear combinations of the columns (feature engineering);
4. Perform automated data cleaning;

and more.

---

For those who are not familiar with `pandas` we recommend these (alternative) tutorials:

1. Single notebook, covers basic pandas functionality (starting with renaming columns ending with using map, apply etc) ~ 30 short examples with links on videos https://nbviewer.jupyter.org/github/justmarkham/pandas-videos/blob/master/pandas.ipynb . Highly recommended for everyone. (about 1-3 hours to go through)

2. https://github.com/guipsamora/pandas_exercises/ 11 topics covering all essential functionality with excersises (with solutions).

This task will be an easy ride after these tutorials.

---

We are using the data on a species of clam called abalone, also known as "Sea ears". This dataset is in public domain and can be obtained from Delve: http://www.cs.toronto.edu/~delve/data/abalone/desc.html For this task, we have modified the dataset slightly, so you could try out some complicated data manipulation techniques while keeping the dataset as simple as possible. So, in our case, each abalone clam is a promising rapper who is ready for a debut mixtape.

You need to place "sea_ears.csv" file in the same directory as this notebook.


In [137]:
import numpy as np
import pandas as pd

# 1. Loading data

As always in Data Science you are starting with making nice cup of tea (or coffee). Your next move is to load the data:

- Start with loading `sea_ears.csv` file using `pd.read_csv()` function.
- You may also want to increase maximal displayed pandas columns: set `pd.options.display.max_columns` to 30
- Print top 10 observations in the table. `.head()`
- Print last 10 observations in the table. `.tail()`
- Print all the data columns names using method `.columns`
- Print data size (number of rows and columns). This is the `.shape` of the data.

*Almost* every python has a `head` and a `tail` just as DataFrames do.

If you are using Google Colab, you can upload the file in the cell below. If you are NOT using Colab, set COLAB_P in the cell below to False.

In [138]:
COLAB_P = False
if COLAB_P:
  print("Upload your file, then read it with pd.read_csv()")
  from google.colab import files
  uploaded = files.upload()
  fn = list(uploaded.keys())[0]
  print("File is uploaded to ", fn)
else:
  print("Place your file to the same directory as the notebook, then read your file with pd.read_csv()")

Place your file to the same directory as the notebook, then read your file with pd.read_csv()


In [139]:
# Load the data
df = pd.read_csv("sea_ears.csv", index_col=0)

In [140]:
# Observe top 10 observations (int)
df.head(10)

Unnamed: 0,id,FN,SN,LN,Captured,Sex,Length,Diam,Height,Whole,Shucke,Viscera,Shell,Rings
0,835,Kid,College,Machine,20200621T000000,I,0.45,0.35,0.13,0.547,0.245,0.1405,0.1405,8.0
1,540,Jalapeno,Glam,Machine,20200308T000000,F,0.5,0.375,0.14,0.604,0.242,0.1415,0.179,15.0
2,2295,Baby,Full,Killer,20200222T000000,F,0.52,0.415,0.145,0.8045,0.3325,0.1725,0.285,10.0
3,858,Kid,Rock,Head,20200222T000000,F,0.595,0.48,0.15,1.11,0.498,0.228,0.33,10.0
4,2329,Boy,Block,Death,20201219T000000,I,0.48,0.39,0.145,0.5825,0.2315,0.121,0.255,15.0
5,2648,DJ,Block,Head,20201224T000000,M,0.5,0.38,0.12,0.5765,0.273,0.135,0.145,9.0
6,3723,Big,Full,Death,20200110T000000,I,0.47,0.355,0.12,0.4915,0.1765,0.1125,0.1325,9.0
7,251,Dungeon,Glam,Death,20200824T000000,M,0.59,0.47,0.18,1.1235,0.4205,0.2805,0.36,13.0
8,1148,MC,College,Machine,20200623T000000,M,0.58,0.45,0.145,1.0025,0.547,0.1975,0.2295,8.0
9,1949,Monster,Full,Kitty,20200202T000000,M,0.64,0.53,0.165,1.1895,0.4765,0.3,0.35,11.0


In [141]:
# Q1.1 What is the length of an abalone with id 986?
# ! Q1.2 How many rings the fifth abalone with "FN" == "Fresh" has?
# ! Q1.3 How many rings the sample with id 67 has?
# Q1.4 What is the `Height` of a thirty fourth sample with `Rings` == 10?
# Q1.5 How heavy as a whole is the female abalone #6?

# Please note that for some questions there are several answers. If that's the 
# case, show all of them. Also for some, there is none. If so, show the shape 
# of returned series in your output.

# Q1.2
df[df.FN == 'Fresh'].reset_index().loc[4, 'Rings'] # Q1.2

8.0

In [142]:
# Q1.3
df[df.id == 67]['Rings']

2269    13.0
Name: Rings, dtype: float64

In [143]:
# Observe last 10 observations (int)
df.tail(10)

Unnamed: 0,id,FN,SN,LN,Captured,Sex,Length,Diam,Height,Whole,Shucke,Viscera,Shell,Rings
5690,3883,Big,Glam,Man,20200211T000000,M,0.54,0.425,0.12,0.817,0.2945,0.153,0.195,10.0
5691,3731,Monster,Glock,Machine,20200411T000000,F,0.55,0.415,0.18,1.1655,0.502,0.301,0.311,9.0
5692,3856,Jalapeno,Rock,Kitty,20200719T000000,I,0.335,0.255,0.085,0.1785,0.071,0.0405,0.055,9.0
5693,1631,Boss,Glock,Man,20200810T000000,I,0.57,0.445,0.145,0.7405,0.306,0.172,0.1825,12.0
5694,283,Jalapeno,Full,Killer,20200804T000000,M,0.485,0.395,0.14,0.6295,0.2285,0.127,0.225,14.0
5695,181,Big,Block,Head,20200512T000000,M,0.64,0.51,0.175,1.368,0.515,0.266,0.57,21.0
5696,1324,Jalapeno,Full,Head,20201215T000000,I,0.565,0.44,0.175,0.8735,0.414,0.21,0.21,11.0
5697,1242,Lil',Flow,Machine,20201206T000000,I,0.385,0.29,0.09,0.2615,0.111,0.0595,0.0745,9.0
5698,2842,Boss,Full,Killer,20200101T000000,M,0.6,0.475,0.175,1.11,0.5105,0.256,0.285,9.0
5699,2071,Baby,Glock,Machine,20200119T000000,F,0.565,0.44,0.135,0.83,0.393,0.1735,0.238,9.0


In [144]:
# Q2.1 What is the diameter of a tenth abalone with second name Flow?
# ! Q2.2 What is the weight of a shell for 99-th abalone with first name Lil'?
# ! Q2.3 How many rings twelfth abalone with first name MC' has?
# Q2.4 How many rings the 666-th abalone has?
# Q2.5 What is the gender of 1337-th abalone?

df[df.FN == "Lil'"].reset_index().loc[98, 'Whole'] # Q2.2 

0.6509999999999999

In [145]:
# Q2.3
df[df.FN == "MC"].reset_index().loc[11, 'Rings']

9.0

In [146]:
# Increase maximal displayed columns
pd.set_option('display.max_columns', 30)

In [147]:
# Observe top 10 observations again
# Is there any new columns displayed? NO
df.head(10)

Unnamed: 0,id,FN,SN,LN,Captured,Sex,Length,Diam,Height,Whole,Shucke,Viscera,Shell,Rings
0,835,Kid,College,Machine,20200621T000000,I,0.45,0.35,0.13,0.547,0.245,0.1405,0.1405,8.0
1,540,Jalapeno,Glam,Machine,20200308T000000,F,0.5,0.375,0.14,0.604,0.242,0.1415,0.179,15.0
2,2295,Baby,Full,Killer,20200222T000000,F,0.52,0.415,0.145,0.8045,0.3325,0.1725,0.285,10.0
3,858,Kid,Rock,Head,20200222T000000,F,0.595,0.48,0.15,1.11,0.498,0.228,0.33,10.0
4,2329,Boy,Block,Death,20201219T000000,I,0.48,0.39,0.145,0.5825,0.2315,0.121,0.255,15.0
5,2648,DJ,Block,Head,20201224T000000,M,0.5,0.38,0.12,0.5765,0.273,0.135,0.145,9.0
6,3723,Big,Full,Death,20200110T000000,I,0.47,0.355,0.12,0.4915,0.1765,0.1125,0.1325,9.0
7,251,Dungeon,Glam,Death,20200824T000000,M,0.59,0.47,0.18,1.1235,0.4205,0.2805,0.36,13.0
8,1148,MC,College,Machine,20200623T000000,M,0.58,0.45,0.145,1.0025,0.547,0.1975,0.2295,8.0
9,1949,Monster,Full,Kitty,20200202T000000,M,0.64,0.53,0.165,1.1895,0.4765,0.3,0.35,11.0


In [148]:
# Print all the columns/features names (int)
df.columns

Index(['id', 'FN', 'SN', 'LN', 'Captured', 'Sex', 'Length', 'Diam', 'Height',
       'Whole', 'Shucke', 'Viscera', 'Shell', 'Rings'],
      dtype='object')

In [149]:
# Q3.1 How many columns end with a vowel?
# ! Q3.2 How many columns start with a vowel? – 1
# Q3.3 Which columns are associated with the weight of the clam?
# ! Q3.4 How many columns have `th` in their names? – 1

In [150]:
# Print data size (int)

df.shape

# Q4.1 How many observations are in the data?
# Q4.2 How many features are in the data?

(5700, 14)

# 2. Basic data exploration

Lets do some basics:

`.count()` number of not NaN's in every column.
    
Is there any missing values in the data?     
Count number of unique values in every column .nunique().    
What does this tells you about the features, which are most likely categorical and which are most likely numerical?    
Use pandas `.describe()` to display basic statistic about the data.   
Use pandas `.value_counts()` to count number of unique values in a specific column.   
Use pandas `.min()`, `.max()`, `.mean()`, `.std()` to display specific statistics about the data.    
Use pandas `.dtypes` field to display data types in columns. 
Hint You could use `.sort_index()` or `.sort_values()` to sort the result of `.value_counts()`


In [151]:
# Display number of not NaN's in every column (int)
df.count()

# Q5.1 How many NA values are in the `Whole` column?
# Q5.2 How many NA values are in the `Rings` column?
# Q5.3 How many NA values are in the `Viscera` column?
# ! Q5.4 How many NA values are in the `FN` column? – 0
# ! Q5.5 How many explicit NA values are in the `Shell` column? – 762


id          5700
FN          5700
SN          5700
LN          5700
Captured    5700
Sex         5700
Length      5545
Diam        5079
Height      4931
Whole       5687
Shucke      4334
Viscera     5158
Shell       4938
Rings       4960
dtype: int64

In [152]:
df.isna().sum()

id             0
FN             0
SN             0
LN             0
Captured       0
Sex            0
Length       155
Diam         621
Height       769
Whole         13
Shucke      1366
Viscera      542
Shell        762
Rings        740
dtype: int64

In [153]:
# Count number of unique values in every column (int)
df.nunique()

# Q6.1 How many unique values are in the `FN` column?
# Q6.2 How many unique values are in the `SN` column?
# ! Q6.3 How many unique values are in the `Rings` column? – 29
# Q6.4 How many unique values are in the `Viscera` column?
# ! Q6.5 How many unique values are in the `Shell` column? – 1687


id          5700
FN            12
SN             8
LN             7
Captured     288
Sex            3
Length      1502
Diam        1013
Height       805
Whole       3939
Shucke      1672
Viscera     1861
Shell       1687
Rings         29
dtype: int64

In [154]:
# Count frequency of the values in different columns (list of ints in ascending order)
# You could select a column using same syntax as for selecting a key from a dictionary: `data[colname]`

# Q7.1 For every unique `FN` value give its number of occurences.
# ! Q7.2 For every unique `SN` value give its number of occurences.
# Q7.3 For every unique `Rings` value give its number of occurences.
# Q7.4 For every unique `Sex` value give its number of occurences.
# ! Q7.5 For every unique `LN` value give its number of occurences.

df.SN.value_counts()

Flow       777
Rock       713
Block      711
College    709
Breeze     709
Full       707
Glock      690
Glam       684
Name: SN, dtype: int64

In [155]:
df.LN.value_counts()

Killer     855
Machine    847
Man        841
Head       820
Death      799
Master     779
Kitty      759
Name: LN, dtype: int64

In [156]:
# Display basic data statistics using .describe()
df.describe

<bound method NDFrame.describe of         id        FN       SN       LN         Captured Sex    Length  \
0      835       Kid  College  Machine  20200621T000000   I  0.450000   
1      540  Jalapeno     Glam  Machine  20200308T000000   F  0.500000   
2     2295      Baby     Full   Killer  20200222T000000   F  0.520000   
3      858       Kid     Rock     Head  20200222T000000   F  0.595000   
4     2329       Boy    Block    Death  20201219T000000   I  0.480000   
5     2648        DJ    Block     Head  20201224T000000   M  0.500000   
6     3723       Big     Full    Death  20200110T000000   I  0.470000   
7      251   Dungeon     Glam    Death  20200824T000000   M  0.590000   
8     1148        MC  College  Machine  20200623T000000   M  0.580000   
9     1949   Monster     Full    Kitty  20200202T000000   M  0.640000   
10    4153      Boss    Glock   Killer  20200505T000000   I  0.430000   
11    3203   Monster     Glam    Death  20200206T000000   F  0.635000   
12    3744       

In [157]:
# Display some column statistics (list of floats, rounded up to 3 digits, e.g. 1.234)

# Q8.1 What are the max, min, mean and the std of the `Viscera` column?
# ! Q8.2 What are the max, min, mean and the std of the `Rings` column? – 29.000, 0.000, 9.827, 3.379
# ! Q8.3 What are the max, min, mean and the std of the `Length` column? – 0.968, 0.075, 0.548, 0.131
# Q8.4 What are the max, min, mean and the std of the `Diam` column?
# Q8.5 What are the max, min, mean and the std of the `Whole` column?

print(df.Rings.max().round(3))
print(df.Rings.min().round(3))
print(round(df.Rings.mean(), 3))
print(round(df.Rings.std(), 3))

29.0
0.0
9.827
3.379


In [158]:
print(df.Length.max().round(3))
print(df.Length.min().round(3))
print(round(df.Length.mean(), 3))
print(round(df.Length.std(), 3))

0.968
0.075
0.548
0.131


In [159]:
# Display data types of all columns (int)
df.dtypes
# Q9.1 How many columns have `object` data type?
# ! Q9.2 How many columns have `int64` data type? – 1
# Q9.3 How many columns have `float64` data type?

# Display data types of all columns (list of str)
# Q9.4 What are the columns with dtype == `float64`?
# ! Q9.5 What are the columns with dtype == `int64`? – id



id            int64
FN           object
SN           object
LN           object
Captured     object
Sex          object
Length      float64
Diam        float64
Height      float64
Whole       float64
Shucke      float64
Viscera     float64
Shell       float64
Rings       float64
dtype: object

In [160]:
df = df.dropna()

# 3. Data selection

In pandas.DataFrame you could select

  Row/s by position (integer number [0 .. number of rows - 1]) .iloc or by DataFrame.index .loc:   

```
  data.loc[0]  
  data.loc[5:10]  
  data.iloc[0]  
  data.iloc[5:10]   
```

Though, this is probably the worst way to manipulate rows.   
  Columns by name

```
  data[columname]
```

  Row/s and columns

```
  data.loc[10, columname]  
  data.iloc[10, columname]  
```

Using boolean mask

```
  mask = data[columname] > value  
  data[mask]  
```

You could combine multiple conditions using & or | (and, or)   

```
cond1 = data[columname1] > value1  
cond2 = data[columname2] > value2  
data[cond1 & cond2]  
```

Using queries .query():  

```
value = 5 
data.query("columname > value")  
```

You could combine multiple conditions using and, or  

```
data.query("(columname1 > value1) and (columname2 > value2)")
```

and others. See https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html for more examples.

Remember to use different quotation marks " or ' for columnname inside a query.


In [161]:
# Select rows by position (int) 

# Q10.1 What is the first name of a clam on row 777? 
# ! Q10.2 What is the last name of a clam on row 999? – Man
# ! Q10.3 How long is a clam from row 1337? – 0.545
# Q10.4 What is the gender of a clam from row 314?
# Q10.5 When was the clam with row of 2718 captured?

# Q10.3
df.loc[1336, 'Length']

0.545

In [162]:
# Q10.2
df.loc[998, 'LN']

'Man'

In [163]:
# Select rows by index (int)

# ! Q11.1 What is the gender of a clam with index 1102? – M
df[df.id == 1102].Sex
# Q11.2 How long is a clam with index 5695?
# Q11.3 How heavy is a clam with index 1045 when still alive?
# Q11.4 When was the clam with index 252 captured?
# ! Q11.5 What is the middle name of a clam with index 38? – Breeze


1256    M
Name: Sex, dtype: object

In [164]:
df[df.id == 38].SN

3191    Breeze
Name: SN, dtype: object

In [165]:
# Using mask or .query syntax select rows/columns (int)

# Q12.1 How many clams have less than 5 rings?
# Q12.2 When were clams named "Boy Rock Killer" captured?
# ! Q12.3 How many clams have length more than 0.1? – 4179
# ! Q12.4 How many clams are heavier (in shell) than 0.3? – 1308
# Q12.5 How many clams were captured at 24 of July?

df[df.Length > 0.1].id.count()

4179

In [166]:
df[df.Shell > 0.3].id.count()

1308

In [167]:
# Using mask or .query syntax select rows/columns (int)

# ! Q13.1 How many clams were captured in the fall? Including both start and end day. – 1009
# Q13.2 How many clams that were captured in the fall, have first name "Lil'"?
# Q13.3 How many clams that are wider than 0.4, have first name "Monster"?
# Q13.4 What was the second name of a clam captured at March, 8, that that was longer than 0.65 and wider than 0.57?
# ! Q13.5 How many rings does an infant clam that was captured in June and has shucked weight between 0.54 and 0.55 have? – 9

df.Captured = pd.to_datetime(df.Captured)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[name] = value


In [168]:
mask1 = df.Captured >= pd.datetime(2020,9,1)
mask2 = df.Captured <= pd.datetime(2020,11,30)
df[mask1 & mask2].id.count()

1009

In [169]:
mask1 = df.Captured >= pd.datetime(2020,6,1)
mask2 = df.Captured <= pd.datetime(2020,6,30)
mask3 = df.Sex == 'I'
mask4 = df.Shucke >= 0.54
mask5 = df.Shucke <= 0.55

In [170]:
df[mask1 & mask2 & mask3 & mask4 & mask5].Rings

1281    9.0
Name: Rings, dtype: float64

In [171]:
# Using mask or .query syntax select rows/columns and compute simple statistics (float)

# Q14.1 What was the average whole weight of clams named "Kitty"?
# ! Q14.2 What was the whole weight of the heaviest Lil' clam? – 2.4925
# Q14.3 What was the weight of the lightest in terms of whole weight clam captured in June?
# ! Q14.4 What is the median length of clams captured in April? – 0.53
# Q14.5 What is the minimum diameter of clams named "Master"?

mask1 = df.FN == "Lil'"
df[mask1].Whole.max()

2.4925

In [172]:
mask1 = df.Captured >= pd.datetime(2020,4,1)
mask2 = df.Captured <= pd.datetime(2020,4,30)
df[mask1 & mask2].Length.median()

0.53

# 4. Creating new columns

Creating new column of pandas.DataFrame is as easy as:
```
data['new_awesome_column'] = [] 
```
that's it. But such a column is relatively useless. Typically, you would compute something new based on existing data and save it in a new column. For example one might want to compute total area of the house as a sum of all sqft_ columns, or create a boolean column of whether the house has grade > 2 or anything else:
```
data['total_area'] = data[col1] + data[col2] + ...
data['high_value'] = data[col] > 5
```
Pandas also provides another powerfull tool: .apply, .map(), .applymap() methods (they are kinda the same, but not quite). https://stackoverflow.com/questions/19798153/difference-between-map-applymap-and-apply-methods-in-pandas . They allow you to apply some function to every value in the column/s (row-wise) or row (column-wise) or cell (element-wise). For example, same computations of total_area and high_value using .apply():
```
data['total_area'] = data[[col1, col2, col3]].apply(sum, axis=1)
```
you are not restricted to existent functions, .apply() accepts any function (including lambda functions):
```
data['total_area'] = data[[col1, col2, col3]].apply(lambda x: x[0]+x[1]+x[2], axis=1)
```
or ordinary python function (if this it should have complex behaviour):
```
def _sum(x):
    total = 0
    for elem in x:
        total += elem
    return total

data['total_area'] = data[[col1, col2, col3]].apply(_sum, axis=1) 
```
Many pandas methods has axis parameter axis=0 refers to rows, axis=1 refers to columns.

Warning. You should never use for loops to sum numerical elements from the container.

In [173]:
# create full_name column with concatenation of all clam names
df['full_name'] = df['FN'] + ' '+ df['SN'] +' '+ df['LN']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [174]:
df.head(5)

Unnamed: 0,id,FN,SN,LN,Captured,Sex,Length,Diam,Height,Whole,Shucke,Viscera,Shell,Rings,full_name
0,835,Kid,College,Machine,2020-06-21,I,0.45,0.35,0.13,0.547,0.245,0.1405,0.1405,8.0,Kid College Machine
1,540,Jalapeno,Glam,Machine,2020-03-08,F,0.5,0.375,0.14,0.604,0.242,0.1415,0.179,15.0,Jalapeno Glam Machine
2,2295,Baby,Full,Killer,2020-02-22,F,0.52,0.415,0.145,0.8045,0.3325,0.1725,0.285,10.0,Baby Full Killer
3,858,Kid,Rock,Head,2020-02-22,F,0.595,0.48,0.15,1.11,0.498,0.228,0.33,10.0,Kid Rock Head
4,2329,Boy,Block,Death,2020-12-19,I,0.48,0.39,0.145,0.5825,0.2315,0.121,0.255,15.0,Boy Block Death


In [175]:
# Create new columns using the old ones (new column in your DataFrame)

# ALL 
# Q15.1 Create a `age_in_years` column (age is the number of rings +1.5) using any method above
# Q15.2 Create a new column `area` using diameter, length and considering that a clam is a perfect rectangle
# Q15.3 Create a new column `density` by dividing volume (area multiplied by a fixed number of 0.05) by whole weight
# Q15.4 Create a new column `age_cat` by splitting a `age` into 5 ([1..5]) distinct intervals: 0 < x <=20%,
# 20% < x <= 40%, ... 80% < x <= 100% percentiles. You could use `.quantile()` to compute percentiles.
# Q15.5 Create a new bool column `high_class` it is True if clam has the proportion of shell in whole clam more or equal to 1

df['age_in_years'] = df['Rings'] + 1.5
df['area']  = df['Diam'] * df['Length']
df['density'] = df['area'] * 0.05 / df['Whole']

def _quant(x):
    if x < df.age_in_years.quantile(0.2):
        total = 1
    elif x < df.age_in_years.quantile(0.4):
        total = 2
    elif x < df.age_in_years.quantile(0.6):
        total = 3
    elif x < df.age_in_years.quantile(0.8):
        total = 4
    elif x <= df.age_in_years.quantile(1):
        total = 5
    
    return total

df['age_cat'] = df['age_in_years'].apply(_quant)
df['high_class'] = df['Shell'] / df['Whole'] >= 1

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  # This is added back by InteractiveShellApp.init_path()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if sys.path[0] == '':
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  del sys.path[0]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: 

In [176]:
# Using mask or .query syntax select rows/columns (float)

# ! Q16.1 What is the average age of the clam of the high_class(=True)? – 7.5
# ! Q16.2 What is the average area of the clam from highest age category? – 0.277
# Q16.3 What is the maximal length amongst clams with the lowest age category?
# Q16.4 What is the most frequent gender amongst clams with the lowest age category?
# Q16.5 What is the minimal number of rings in clams with high_class=True?

df[df.high_class == True].age_in_years.mean()

7.5

In [177]:
round(df[df.age_cat == 5].area.mean(), 3)

0.277

# 5. Basic date processing

You figure out that column date is to harsh for you, so you decided to convert it to a more plausible format:

- Use pandas method to_datetime() to convert the date to a good format.
- Extract year, month, day and weekday from your new date column. Save them to separete columns.
- How many columns has your data now?
- Drop column date, remember to set inplace parameter to True.

Hint: for datetime formatted date you could extract the year as follow:
```
data.date.dt.year
```
Very often date could be a ridiculously rich feature, sometimes it is holidays that matters, sometimes weekends, sometimes some special days like black friday.

Learn how to work with date in Python!


In [178]:
# Create new columns based on `Captured` column
# ALL
# Q17.1 Convert date to datetime format
df.Captured = pd.to_datetime(df.Captured)
# Q17.2 Extract and store `year`
df['year'] = df.Captured.dt.year
# Q17.3 Extract and store `month`
df['month'] = df.Captured.dt.month
# Q17.4 Extract and store `day`
df['day'] = df.Captured.dt.day
# Q17.5 Extract and store `weekday`
df['weekday'] = df.Captured.dt.weekday
# Q17.6 Create a new column `age10` - the age of the clam in full decades (e.g. 9 year old clam - 0, 21 year old clam - 2)
df['age10'] = df['age_in_years'] // 10
from datetime import datetime


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  # Remove the CWD from sys.path while we load stuff.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/sta

In [179]:
df.head(5)

Unnamed: 0,id,FN,SN,LN,Captured,Sex,Length,Diam,Height,Whole,Shucke,Viscera,Shell,Rings,full_name,age_in_years,area,density,age_cat,high_class,year,month,day,weekday,age10
0,835,Kid,College,Machine,2020-06-21,I,0.45,0.35,0.13,0.547,0.245,0.1405,0.1405,8.0,Kid College Machine,9.5,0.1575,0.014397,2,False,2020,6,21,6,0.0
1,540,Jalapeno,Glam,Machine,2020-03-08,F,0.5,0.375,0.14,0.604,0.242,0.1415,0.179,15.0,Jalapeno Glam Machine,16.5,0.1875,0.015522,5,False,2020,3,8,6,1.0
2,2295,Baby,Full,Killer,2020-02-22,F,0.52,0.415,0.145,0.8045,0.3325,0.1725,0.285,10.0,Baby Full Killer,11.5,0.2158,0.013412,4,False,2020,2,22,5,1.0
3,858,Kid,Rock,Head,2020-02-22,F,0.595,0.48,0.15,1.11,0.498,0.228,0.33,10.0,Kid Rock Head,11.5,0.2856,0.012865,4,False,2020,2,22,5,1.0
4,2329,Boy,Block,Death,2020-12-19,I,0.48,0.39,0.145,0.5825,0.2315,0.121,0.255,15.0,Boy Block Death,16.5,0.1872,0.016069,5,False,2020,12,19,5,1.0


In [180]:
# Drop column `Captured`

del df["Captured"]

In [181]:
df.head(5)

Unnamed: 0,id,FN,SN,LN,Sex,Length,Diam,Height,Whole,Shucke,Viscera,Shell,Rings,full_name,age_in_years,area,density,age_cat,high_class,year,month,day,weekday,age10
0,835,Kid,College,Machine,I,0.45,0.35,0.13,0.547,0.245,0.1405,0.1405,8.0,Kid College Machine,9.5,0.1575,0.014397,2,False,2020,6,21,6,0.0
1,540,Jalapeno,Glam,Machine,F,0.5,0.375,0.14,0.604,0.242,0.1415,0.179,15.0,Jalapeno Glam Machine,16.5,0.1875,0.015522,5,False,2020,3,8,6,1.0
2,2295,Baby,Full,Killer,F,0.52,0.415,0.145,0.8045,0.3325,0.1725,0.285,10.0,Baby Full Killer,11.5,0.2158,0.013412,4,False,2020,2,22,5,1.0
3,858,Kid,Rock,Head,F,0.595,0.48,0.15,1.11,0.498,0.228,0.33,10.0,Kid Rock Head,11.5,0.2856,0.012865,4,False,2020,2,22,5,1.0
4,2329,Boy,Block,Death,I,0.48,0.39,0.145,0.5825,0.2315,0.121,0.255,15.0,Boy Block Death,16.5,0.1872,0.016069,5,False,2020,12,19,5,1.0


In [182]:
# Find some date related information from the data (int)

# Q18.1 What is the most popular capturing weekday?
# ! Q18.2 What is the most popular capturing month? – 3
df['month'].value_counts()
# Q18.3 What is the least popular capturing weekday?
# Q18.4 What is the median age of the clam? (float)
# ! Q18.5 How many clams were captured on the Day of Russia (June, 12)? – 8

3     390
5     382
7     365
2     358
10    350
9     350
6     343
4     342
1     333
12    331
8     327
11    309
Name: month, dtype: int64

In [183]:
df[(df.month == 6) & (df.day == 12)].id.count()

8

# 6. Groupby

from the documentation https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html

By “group by” we are referring to a process involving one or more of the following steps:

- Splitting the data into groups based on some criteria.
- Applying a function to each group independently.
- Combining the results into a data structure.

`.groupby()` is one of the most powerfull tool for feature engineering. Very often it is used to group object with the same categorical characteristics and compute some statistics (e.g. mean, max, etc.) of a their numerical characteric.

Instead of computing average area of houses with high grade you could compute average areas of the houses for every grade in a single command:
```
data.groupby('grade')['sqm_tot_area'].mean()
```
You could also make multi-column groups:
```
data.groupby(['weekday','grade'])['price'].min()
```
next, you could compute multiple aggregation functions:
```
data.groupby(['weekday','grade'])['price'].agg([min, max])
```
instead of using built-in functions you could compute custom functions using apply:
```
import numpy as np
data.groupby(['condition','grade'])['bathrooms'].apply(lambda x: np.quantile(x, .5))
```
and the coolest thing now is that you can map the results of groupby back on your DataFrame!
```
gp = data.groupby(['condition'])['bathrooms'].median()
data['gp_feature'] = data['condition'].map(gp)
```
Now, if some house has condition == 2, its gp_feature will be equal to the median number of bathrooms amongst all houses with condition == 2.

Read more examples in the documentation https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html


In [184]:
# Create some groupby features

# Q19.1 `whole_by_year` groupby `year` and compute median whole weight.
# Q19.2 `shell_by_weekday` groupby `weekday` and compute median shell weight.
# ! Q19.3 `area_by_age` groupby `age_cat` and compute average `area`.
# ! Q19.4 `density_by_age` groupby `age_cat` and compute average density of a clam.

area_by_age = df.groupby('age_cat')['area'].mean()
df['area_by_age'] = df['age_cat'].map(area_by_age)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


In [185]:
density_by_age = df.groupby('age_cat')['density'].mean()
df['density_by_age'] = df['age_cat'].map(density_by_age)
df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


Unnamed: 0,id,FN,SN,LN,Sex,Length,Diam,Height,Whole,Shucke,Viscera,Shell,Rings,full_name,age_in_years,area,density,age_cat,high_class,year,month,day,weekday,age10,area_by_age,density_by_age
0,835,Kid,College,Machine,I,0.450,0.350,0.130,0.5470,0.2450,0.1405,0.1405,8.0,Kid College Machine,9.5,0.157500,0.014397,2,False,2020,6,21,6,0.0,0.174115,0.018027
1,540,Jalapeno,Glam,Machine,F,0.500,0.375,0.140,0.6040,0.2420,0.1415,0.1790,15.0,Jalapeno Glam Machine,16.5,0.187500,0.015522,5,False,2020,3,8,6,1.0,0.277196,0.013282
2,2295,Baby,Full,Killer,F,0.520,0.415,0.145,0.8045,0.3325,0.1725,0.2850,10.0,Baby Full Killer,11.5,0.215800,0.013412,4,False,2020,2,22,5,1.0,0.274417,0.013939
3,858,Kid,Rock,Head,F,0.595,0.480,0.150,1.1100,0.4980,0.2280,0.3300,10.0,Kid Rock Head,11.5,0.285600,0.012865,4,False,2020,2,22,5,1.0,0.274417,0.013939
4,2329,Boy,Block,Death,I,0.480,0.390,0.145,0.5825,0.2315,0.1210,0.2550,15.0,Boy Block Death,16.5,0.187200,0.016069,5,False,2020,12,19,5,1.0,0.277196,0.013282
5,2648,DJ,Block,Head,M,0.500,0.380,0.120,0.5765,0.2730,0.1350,0.1450,9.0,DJ Block Head,10.5,0.190000,0.016479,3,False,2020,12,24,3,1.0,0.237983,0.015130
6,3723,Big,Full,Death,I,0.470,0.355,0.120,0.4915,0.1765,0.1125,0.1325,9.0,Big Full Death,10.5,0.166850,0.016974,3,False,2020,1,10,4,1.0,0.237983,0.015130
7,251,Dungeon,Glam,Death,M,0.590,0.470,0.180,1.1235,0.4205,0.2805,0.3600,13.0,Dungeon Glam Death,14.5,0.277300,0.012341,5,False,2020,8,24,0,1.0,0.277196,0.013282
8,1148,MC,College,Machine,M,0.580,0.450,0.145,1.0025,0.5470,0.1975,0.2295,8.0,MC College Machine,9.5,0.261000,0.013017,2,False,2020,6,23,1,0.0,0.174115,0.018027
9,1949,Monster,Full,Kitty,M,0.640,0.530,0.165,1.1895,0.4765,0.3000,0.3500,11.0,Monster Full Kitty,12.5,0.339200,0.014258,4,False,2020,2,2,6,1.0,0.274417,0.013939


In [186]:
# Create some other groupby features
# for this task check out this answer:
# https://stackoverflow.com/questions/47913343/how-to-groupby-and-map-by-two-columns-pandas-dataframe

# ! Q20.1 `rings_fn` groupby `n_rings` and count average number of occurences of every unique first name

# ! Q20.2 `n_month` groupby `month` and count number of captured in each month
n_month = df.groupby('month')['month'].count()
n_month

month
1     333
2     358
3     390
4     342
5     382
6     343
7     365
8     327
9     350
10    350
11    309
12    331
Name: month, dtype: int64

In [187]:
# Q20.1
rings_fn = df.groupby(['Rings', 'FN']).size()
rings_fn = rings_fn.reset_index(level=['Rings', 'FN'], name='counts')
rings_fn = rings_fn.groupby('Rings')['counts'].mean()
rings_fn

Rings
1.0      1.000000
2.0      1.000000
3.0      1.875000
4.0      4.750000
5.0      9.583333
6.0     21.583333
7.0     32.666667
8.0     47.333333
9.0     57.416667
10.0    52.833333
11.0    40.666667
12.0    22.333333
13.0    16.916667
14.0    10.500000
15.0     8.583333
16.0     5.583333
17.0     4.833333
18.0     3.818182
19.0     2.909091
20.0     2.888889
21.0     1.555556
22.0     1.200000
23.0     1.125000
24.0     1.000000
25.0     1.000000
26.0     1.000000
27.0     1.000000
29.0     1.000000
Name: counts, dtype: float64

# 7. Building a regression model

- You do not need to normalize data for tree models, and for linear/knn models this step is essential.
- Remember, that not all of the features in the table are numeric, some of them might be viewed as categorical.
-You may create or drop any features you want, except for the features which use age or number of rings (e.g. average number of rings from a clam of high class).



In [188]:
df.head(5)

Unnamed: 0,id,FN,SN,LN,Sex,Length,Diam,Height,Whole,Shucke,Viscera,Shell,Rings,full_name,age_in_years,area,density,age_cat,high_class,year,month,day,weekday,age10,area_by_age,density_by_age
0,835,Kid,College,Machine,I,0.45,0.35,0.13,0.547,0.245,0.1405,0.1405,8.0,Kid College Machine,9.5,0.1575,0.014397,2,False,2020,6,21,6,0.0,0.174115,0.018027
1,540,Jalapeno,Glam,Machine,F,0.5,0.375,0.14,0.604,0.242,0.1415,0.179,15.0,Jalapeno Glam Machine,16.5,0.1875,0.015522,5,False,2020,3,8,6,1.0,0.277196,0.013282
2,2295,Baby,Full,Killer,F,0.52,0.415,0.145,0.8045,0.3325,0.1725,0.285,10.0,Baby Full Killer,11.5,0.2158,0.013412,4,False,2020,2,22,5,1.0,0.274417,0.013939
3,858,Kid,Rock,Head,F,0.595,0.48,0.15,1.11,0.498,0.228,0.33,10.0,Kid Rock Head,11.5,0.2856,0.012865,4,False,2020,2,22,5,1.0,0.274417,0.013939
4,2329,Boy,Block,Death,I,0.48,0.39,0.145,0.5825,0.2315,0.121,0.255,15.0,Boy Block Death,16.5,0.1872,0.016069,5,False,2020,12,19,5,1.0,0.277196,0.013282


In [189]:
# Q21 Drop all generated features which used age or number of rings column, e.g. rings_month, age_cat.
Y = df["age_in_years"].values

good_columns = ['Length', 'Diam', 'Height', 'Whole', 'Shucke', 'Viscera', 'Shell', 'density', 'area']
cat_columns = 'Sex'

dummy_columns = pd.get_dummies(df[cat_columns])
X = pd.concat([df[good_columns], pd.get_dummies(df[cat_columns])], axis=1)
X.head(5)

Unnamed: 0,Length,Diam,Height,Whole,Shucke,Viscera,Shell,density,area,F,I,M
0,0.45,0.35,0.13,0.547,0.245,0.1405,0.1405,0.014397,0.1575,0,1,0
1,0.5,0.375,0.14,0.604,0.242,0.1415,0.179,0.015522,0.1875,1,0,0
2,0.52,0.415,0.145,0.8045,0.3325,0.1725,0.285,0.013412,0.2158,1,0,0
3,0.595,0.48,0.15,1.11,0.498,0.228,0.33,0.012865,0.2856,1,0,0
4,0.48,0.39,0.145,0.5825,0.2315,0.121,0.255,0.016069,0.1872,0,1,0


In [190]:
# Q22 Split your data into train and test parts.
# How many records (rows) do you have in train and test tables? (list of int)?
# Use sklearn.model_selection.train_test_split with test_size=0.33 and random_state=7

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.33, random_state=7)
X_train.shape

(2800, 12)

In [191]:
X_test.shape

(1380, 12)

In [192]:
# normalize data
from sklearn.preprocessing import MinMaxScaler, StandardScaler
scaler = StandardScaler()
scaler.fit(X_train[good_columns])

X_train[good_columns] = scaler.transform(X_train[good_columns])
X_test[good_columns]  = scaler.transform(X_test[good_columns])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  import sys
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#index

In [193]:
# Fit predictive regression models.

# ! Q23.1 Use linear regression with l2 regularization (Ridge regression)
# Q23.2 Use decision tree regression
# ! Q23.3 Use k nearest neighbours regression


In [194]:
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error as mse

In [195]:
# Ridge
reg = Ridge()
reg.fit(X_train, y_train)

y_pred = reg.predict(X_test)
mse(y_test, y_pred)

4.89872526598317

In [196]:
from sklearn.neighbors import KNeighborsRegressor

In [197]:
# KNN
neighbors = KNeighborsRegressor()
neighbors.fit(X_train, y_train)

y_pred = neighbors.predict(X_test)
mse(y_test, y_pred)

5.743072463768116

In [198]:
# Use grid search to select optimal hyperparamters of your models.

# ! Q24.1 Alpha for a ridge regression
# Q24.2 Depth for the tree
# ! Q24.3 Number of neighbours for the knn
from sklearn.model_selection import GridSearchCV

In [199]:
import sklearn

In [200]:
parameters = {'alpha':np.linspace(0,10,500)}
reg = Ridge()
reg_gs = GridSearchCV(reg, parameters, cv=5, n_jobs=-1, scoring='neg_mean_squared_error')
reg_gs.fit(X_train, y_train)
print(reg_gs.best_params_)
-reg_gs.best_score_

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  fold_sizes = np.full(n_splits, n_samples // n_splits, dtype=np.int)
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  test_mask = np.zeros(_num_samples(X), dtype=np.bool)
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  test_mask = np.zeros(_num_samples(X), dtype=np.bool)
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  test_mask = np.zeros(_num_samples(X), dtype=np.bool)
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  test_mask = np.zeros(_num_samples(X), dtype=np.bool)
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-no

{'alpha': 0.02004008016032064}


Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  dtype=np.int)


4.697366693105089

In [201]:
parameters = {'n_neighbors':np.arange(1,30,1)}
neighbors = KNeighborsRegressor()
neighbors_gs = GridSearchCV(neighbors, parameters, cv=5, n_jobs=-1, scoring='neg_mean_squared_error')
neighbors_gs.fit(X_train, y_train)
print(neighbors_gs.best_params_)
-neighbors_gs.best_score_

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  fold_sizes = np.full(n_splits, n_samples // n_splits, dtype=np.int)
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  test_mask = np.zeros(_num_samples(X), dtype=np.bool)
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  test_mask = np.zeros(_num_samples(X), dtype=np.bool)
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  test_mask = np.zeros(_num_samples(X), dtype=np.bool)
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  test_mask = np.zeros(_num_samples(X), dtype=np.bool)
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-no

{'n_neighbors': 18}


Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  dtype=np.int)


4.984729938271605

In [202]:
# Compute train and test mean squared error for your best models (list of float).

# ! Q25.1 Train, test MSE using linear regression with l2 regularization
# Q25.2 Train, test MSE using decision tree regression
# ! Q25.3 Train, test MSE using k nearest neighbours regression

from sklearn.metrics import mean_squared_error

In [203]:
reg = Ridge(**reg_gs.best_params_)
reg.fit(X_train, y_train)

y_pred_reg = reg.predict(X_test)
y_train_pred_reg = reg.predict(X_train)

In [204]:
neighbors = KNeighborsRegressor(**neighbors_gs.best_params_)
neighbors.fit(X_train, y_train)

y_pred_neighbors = neighbors.predict(X_test)
y_train_pred_neighbors = neighbors.predict(X_train)

In [205]:
mse(y_train, y_train_pred_reg)

4.590166549068355

In [206]:
mse(y_test, y_pred_reg)

4.8937231915877835

In [207]:
mse(y_train, y_train_pred_neighbors)

4.3986287477954145

In [208]:
mse(y_test, y_pred_neighbors)

5.337084004294149

In [209]:
# Compute train and test R^2 for your best models (list of float).

# ! Q26.1 Train, test R^2 using linear regression with l2 regularization
# Q26.2 Train, test R^2 using decision tree regression
# ! Q26.3 Train, test R^2 using k nearest neighbours regression

from scipy.stats import pearsonr
from sklearn.metrics import r2_score

In [210]:
r2_score(y_train, y_train_pred_reg)

0.5605213065451612

In [211]:
r2_score(y_test, y_pred_reg)

0.5234055377470688

In [212]:
r2_score(y_train, y_train_pred_neighbors)

0.5788598094623874

In [213]:
r2_score(y_test, y_pred_neighbors)

0.4802271028737929

In [214]:
#pearsonr(y_test, y_pred_reg)[0]

In [215]:
#pearsonr(y_test, y_pred_neighbors)[0]

In [216]:
# Q27 Which features have largest (by absolute value) weight in your linear model (top 5 features)? (list of str).

In [217]:
weights = np.abs(reg.coef_)

In [218]:
X_train.columns[np.argsort(weights)[::-1][:5]]

Index(['Whole', 'Shucke', 'area', 'Diam', 'Shell'], dtype='object')

In [219]:
reg.coef_[np.argsort(weights)[::-1][:5]]

array([ 4.28246254, -3.90875603, -3.72818985,  2.16875247,  1.90052695])

# Make sure your .ipynb is linearly executable     
# Kernel -> Restart & Run All -> No ERROR cells