<a href="https://colab.research.google.com/github/lalitgarg12/python-exercises/blob/master/Minimal_Sufficient_Pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Minimally Sufficient Pandas - All sources referenced from Ted Petrou Github repository

In [0]:
import pandas as pd


In [8]:
df = pd.read_csv('https://raw.githubusercontent.com/lalitgarg12/python-exercises/master/sample_data.csv',index_col=0)
df

Unnamed: 0_level_0,state,color,favorite food,age,height,score,count
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Jane,NY,blue,Steak,30,165,4.6,10
Niko,TX,green,Lamb,2,70,8.3,4
Aaron,FL,red,Mango,12,120,9.0,3
Penelope,AL,white,Apple,4,80,3.3,12
Dean,AK,gray,Cheese,32,180,1.8,8
Christina,TX,black,Melon,33,172,9.5,99
Cornelia,TX,red,Beans,69,150,2.2,44


**Selection with the brackets**

Placing a column name in the brackets appended to a DataFrame selects a single column of a DataFrame as a Series.

In [4]:
df['state']

Jane         NY
Niko         TX
Aaron        FL
Penelope     AL
Dean         AK
Christina    TX
Cornelia     TX
Name: state, dtype: object

**Selection with dot notation**

Alternatively, you may select a single column with dot notation. Simply, place the name of the column after the dot operator. The output is the exact same as above.

In [5]:
df.state

Jane         NY
Niko         TX
Aaron        FL
Penelope     AL
Dean         AK
Christina    TX
Cornelia     TX
Name: state, dtype: object

**Issues with the dot notation**

There are three issues with using dot notation. It doesn’t work in the following situations:

- When there are spaces in the column name
- When the column name is the same as a DataFrame method
- When the column name is a variable

In [6]:
df.favorite food

SyntaxError: ignored

You can only use the brackets to select columns with spaces.

In [9]:
df['favorite food']

name
Jane          Steak
Niko           Lamb
Aaron         Mango
Penelope      Apple
Dean         Cheese
Christina     Melon
Cornelia      Beans
Name: favorite food, dtype: object

**The column name is the same as a DataFrame method**

When a column name and a DataFrame method collide, Pandas will always reference the method and not the column name. For instance, the column namecount is a method and will be referenced when using dot notation. This actually doesn’t produce an error as Python allows you to reference methods without calling them. Let’s reference this method now.

In [10]:
df.count

<bound method DataFrame.count of           state  color favorite food  age  height  score  count
name                                                           
Jane         NY   blue         Steak   30     165    4.6     10
Niko         TX  green          Lamb    2      70    8.3      4
Aaron        FL    red         Mango   12     120    9.0      3
Penelope     AL  white         Apple    4      80    3.3     12
Dean         AK   gray        Cheese   32     180    1.8      8
Christina    TX  black         Melon   33     172    9.5     99
Cornelia     TX    red         Beans   69     150    2.2     44>

Regardless, it’s clear that using dot notation did not select a single column of the DataFrame as a Series. Again, you must use the brackets when selecting a column with the same name as a DataFrame method.

In [11]:
df['count']

name
Jane         10
Niko          4
Aaron         3
Penelope     12
Dean          8
Christina    99
Cornelia     44
Name: count, dtype: int64

**Guidance: Use the brackets for selecting a column of data**

The dot notation provides no additional functionality over the brackets and does not work in all situations. Therefore, I never use it. Its single advantage is three fewer keystrokes.

**Performance comparison iloc vs iat vs NumPy**

Let’s compare the perfomance of selecting a single cell with iloc, iat and a NumPy array. Here we create a NumPy array with 100k rows and 5 columns containing random data. We then create a DataFrame out of it and make the selections.

In [12]:
import numpy as np
a = np.random.rand(10 ** 5, 5)
df1 = pd.DataFrame(a)
row = 50000
col = 3
%timeit df1.iloc[row, col]

The slowest run took 35.43 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 7.62 µs per loop


In [13]:
%timeit df1.iat[row, col]

The slowest run took 26.59 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 5.06 µs per loop


In [14]:
%timeit a[row, col]

The slowest run took 70.95 times longer than the fastest. This could mean that an intermediate result is being cached.
10000000 loops, best of 3: 176 ns per loop


**Guidance:** Use **NumPy** arrays if your application relies on performance for selecting a single cell of data and **not at or iat**.

**Method Duplication**

There are multiple methods in Pandas that do the exact same thing. Whenever two methods share the same exact underlying functionality, we say that they are aliases of each other. Having duplication in a library is completely unnecessary, pollutes the namespace and forces analysts to remember one more bit of information about a library.

This next section covers several instances of duplication along with other instances of methods that are very similar to one another.

**read_csv vs read_table duplication**

In [16]:
college = pd.read_csv('https://raw.githubusercontent.com/lalitgarg12/python-exercises/master/college.csv')
college.head()

Unnamed: 0,instnm,city,stabbr,hbcu,menonly,womenonly,relaffil,satvrmid,satmtmid,distanceonly,ugds,ugds_white,ugds_black,ugds_hisp,ugds_asian,ugds_aian,ugds_nhpi,ugds_2mor,ugds_nra,ugds_unkn,pptug_ef,curroper,pctpell,pctfloan,ug25abv,md_earn_wne_p10,grad_debt_mdn_supp
0,Alabama A & M University,Normal,AL,1.0,0.0,0.0,0,424.0,420.0,0.0,4206.0,0.0333,0.9353,0.0055,0.0019,0.0024,0.0019,0.0,0.0059,0.0138,0.0656,1,0.7356,0.8284,0.1049,30300,33888.0
1,University of Alabama at Birmingham,Birmingham,AL,0.0,0.0,0.0,0,570.0,565.0,0.0,11383.0,0.5922,0.26,0.0283,0.0518,0.0022,0.0007,0.0368,0.0179,0.01,0.2607,1,0.346,0.5214,0.2422,39700,21941.5
2,Amridge University,Montgomery,AL,0.0,0.0,0.0,1,,,1.0,291.0,0.299,0.4192,0.0069,0.0034,0.0,0.0,0.0,0.0,0.2715,0.4536,1,0.6801,0.7795,0.854,40100,23370.0
3,University of Alabama in Huntsville,Huntsville,AL,0.0,0.0,0.0,0,595.0,590.0,0.0,5451.0,0.6988,0.1255,0.0382,0.0376,0.0143,0.0002,0.0172,0.0332,0.035,0.2146,1,0.3072,0.4596,0.264,45500,24097.0
4,Alabama State University,Montgomery,AL,1.0,0.0,0.0,0,425.0,430.0,0.0,4811.0,0.0158,0.9208,0.0121,0.0019,0.001,0.0006,0.0098,0.0243,0.0137,0.0892,1,0.7347,0.7554,0.127,26600,33118.5


In [0]:
college2 = pd.read_table('https://raw.githubusercontent.com/lalitgarg12/python-exercises/master/college.csv', delimiter=',')

In [18]:
college.equals(college2)

True

**Guidance: Only use read_csv to read in delimitted text files**

**isna vs isnull and notna vs notnull**

The isna and isnull methods both determine whether each value in the DataFrame is missing or not. The result will always be a DataFrame (or Series) of all boolean values.

In [19]:
college_isna = college.isna()
college_isnull = college.isnull()
college_isna.equals(college_isnull)

True

**Guidance: Use only isna and notna**