This notebook is a later addition to the Machine Learning notebook series in this folder. All other notebooks are created according to lessons from O'Reilly, but when preparing for data analytics/science interviews, there are some common functions that I should master

1. `notnull()`
2. `rename()`
3. Indexing:
  - `set_index()`: choose an existing column to promote as the index
  - `reset_index()`: just index 0 to n. Make the current index a column. Or drop it.
  - `reindex()`: rearrange the rows by an index sequence. If index i is new, then values in that column is, of course, NaN. So different function from reset_index()
4. `nunique()`: `count()` of distinct values, like SQL "SELECT COUNT(DISTINCT x)"
5. Duplicates: 
  - `duplicated()` to check if each row/col has duplicates
  - `drop_duplicates()` drops duplicates. `keep='first'` by default
6. `np.select()`: assigns `choicelist[i]` whenever the first `condlist[i]` is met
7. `.str.replace(pat, target, regex=True)`

In [1]:
# Imports and load dataset
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sb; sb.set()
planets = sb.load_dataset('planets')
planets.head()

Unnamed: 0,method,number,orbital_period,mass,distance,year
0,Radial Velocity,1,269.3,7.1,77.4,2006
1,Radial Velocity,1,874.774,2.21,56.95,2008
2,Radial Velocity,1,763.0,2.6,19.84,2011
3,Radial Velocity,1,326.03,19.4,110.62,2007
4,Radial Velocity,1,516.22,10.5,119.47,2009


In [2]:
# Process only data whose 'mass' is non-null
planets[planets['mass'].notnull()].head()

Unnamed: 0,method,number,orbital_period,mass,distance,year
0,Radial Velocity,1,269.3,7.1,77.4,2006
1,Radial Velocity,1,874.774,2.21,56.95,2008
2,Radial Velocity,1,763.0,2.6,19.84,2011
3,Radial Velocity,1,326.03,19.4,110.62,2007
4,Radial Velocity,1,516.22,10.5,119.47,2009


In [15]:
decade = 10*(planets.year//10)
decade = decade.astype(str) + "s"
decade.name = "decade"

pd.DataFrame(planets[planets['mass'].notnull()].groupby([decade])['distance'].mean().sort_values(ascending=False))
pd.DataFrame(planets[planets['mass'].notnull()].groupby([decade])['distance'].mean()).sort_values(by=['distance'], ascending=False)

Unnamed: 0_level_0,distance
decade,Unnamed: 1_level_1
2010s,57.754432
2000s,50.758606
1980s,40.57
1990s,25.4844


In [None]:
# rename
planets.rename(columns={'method': 'method_name'}, inplace=False)

In [None]:
# reset_index

# set_index

# reindex

In [None]:
# duplicates
planets[planets.duplicated(subset='year', keep=False)]  # mark all duplicates as true, including the first occurence

In [7]:
# Assign a symbol corresponding to the condition met
d = planets.distance
condlist = [d < 50, d.between(50, 70), d > 70]
choicelist = ['small', 'med', 'big']

new_col = np.select(condlist, choicelist)

pd.concat([planets, pd.Series(new_col)], axis=1).head()

Unnamed: 0,method,number,orbital_period,mass,distance,year,0
0,Radial Velocity,1,269.3,7.1,77.4,2006,big
1,Radial Velocity,1,874.774,2.21,56.95,2008,med
2,Radial Velocity,1,763.0,2.6,19.84,2011,small
3,Radial Velocity,1,326.03,19.4,110.62,2007,big
4,Radial Velocity,1,516.22,10.5,119.47,2009,big


In [9]:
# String: regex
planets['method'].str.replace("[^a-zA-Z]", "", regex=True).str.len().head()

0    14
1    14
2    14
3    14
4    14
Name: method, dtype: int64