<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#README" data-toc-modified-id="README-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>README</a></span></li><li><span><a href="#Foundations-of-Pandas" data-toc-modified-id="Foundations-of-Pandas-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Foundations of Pandas</a></span><ul class="toc-item"><li><span><a href="#'Series'" data-toc-modified-id="'Series'-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>'Series'</a></span></li><li><span><a href="#'Data-frames'" data-toc-modified-id="'Data-frames'-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>'Data frames'</a></span></li></ul></li><li><span><a href="#Essential-DataFrame-Operations" data-toc-modified-id="Essential-DataFrame-Operations-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Essential DataFrame Operations</a></span><ul class="toc-item"><li><span><a href="#I/O" data-toc-modified-id="I/O-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>I/O</a></span></li><li><span><a href="#Head/Tail" data-toc-modified-id="Head/Tail-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Head/Tail</a></span></li><li><span><a href="#Info" data-toc-modified-id="Info-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Info</a></span></li><li><span><a href="#Descriptives" data-toc-modified-id="Descriptives-3.4"><span class="toc-item-num">3.4&nbsp;&nbsp;</span>Descriptives</a></span></li></ul></li><li><span><a href="#Data-Indexing-and-Slicing" data-toc-modified-id="Data-Indexing-and-Slicing-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Data Indexing and Slicing</a></span><ul class="toc-item"><li><span><a href="#.loc" data-toc-modified-id=".loc-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>.loc</a></span></li><li><span><a href="#Query-Method" data-toc-modified-id="Query-Method-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Query Method</a></span></li><li><span><a href="#Set-Reset-an-Index" data-toc-modified-id="Set-Reset-an-Index-4.3"><span class="toc-item-num">4.3&nbsp;&nbsp;</span>Set-Reset an Index</a></span></li></ul></li></ul></div>

# README

observation - by - variable

| var1 | var2 | ... | varn
---|----|----|-----|----
obs1 |||||
obs2 |||||
obsi |||||

# Foundations of Pandas

In [1]:
import pandas as pd
import numpy as np

In [2]:
import wget

In [3]:
import sh

## 'Series'

In [4]:
X = np.linspace(1, 5, 5)
N = ['a', 'b', 'c', 'd', 'e']
S = pd.Series(X, N)

In [5]:
S

a    1.0
b    2.0
c    3.0
d    4.0
e    5.0
dtype: float64

In [6]:
S["a"]

1.0

In [7]:
S[0]

1.0

## 'Data frames'

In [8]:
D = {"var_1": [1, 2, 3, 4], "var_2": ["s1", "s2", "s3", "s4"]}
DF = pd.DataFrame(D)

In [9]:
DF

Unnamed: 0,var_1,var_2
0,1,s1
1,2,s2
2,3,s3
3,4,s4


# Essential DataFrame Operations

## I/O

In [10]:
import wget

In [17]:
# Download the data
wget.download("http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Grocery_and_Gourmet_Food_5.json.gz")

'reviews_Grocery_and_Gourmet_Food_5.json.gz'

In [18]:
import sh

In [19]:
sh.gunzip("reviews_Grocery_and_Gourmet_Food_5.json.gz")



Resources on 'json' format:
https://www.w3schools.com/js/js_json_syntax.asp

In [20]:
F = "reviews_Grocery_and_Gourmet_Food_5.json"
DF = pd.io.json.json_normalize([json.loads(line) for line in open(F)])

Julian's data codebook:

- reviewerID - ID of the reviewer, e.g. A2SUAM1J3GNN3B
- asin - ID of the product, e.g. 0000013714
- reviewerName - name of the reviewer
- helpful - helpfulness rating of the review, e.g. 2/3
- reviewText - text of the review
- overall - rating of the product
- summary - summary of the review
- unixReviewTime - time of the review (unix time)
- reviewTime - time of the review (raw)

In [22]:
# writing data frame to file
DF.to_csv("reviews_Grocery_and_Gourmet_Food_5.csv", index=False)

In [23]:
# reading data from a file 
DF = pd.read_csv("reviews_Grocery_and_Gourmet_Food_5.csv")

## Head/Tail

In [27]:
DF.head()

Unnamed: 0,asin,helpful,overall,reviewText,reviewTime,reviewerID,reviewerName,summary,unixReviewTime
0,616719923X,"[0, 0]",4.0,Just another flavor of Kit Kat but the taste i...,"06 1, 2013",A1VEELTKS8NLZB,Amazon Customer,Good Taste,1370044800
1,616719923X,"[0, 1]",3.0,I bought this on impulse and it comes from Jap...,"05 19, 2014",A14R9XMZVJ6INB,amf0001,"3.5 stars, sadly not as wonderful as I had hoped",1400457600
2,616719923X,"[3, 4]",4.0,Really good. Great gift for any fan of green t...,"10 8, 2013",A27IQHDZFQFNGG,Caitlin,Yum!,1381190400
3,616719923X,"[0, 0]",5.0,"I had never had it before, was curious to see ...","05 20, 2013",A31QY5TASILE89,DebraDownSth,Unexpected flavor meld,1369008000
4,616719923X,"[1, 2]",4.0,I've been looking forward to trying these afte...,"05 26, 2013",A2LWK003FFMCI5,Diana X.,"Not a very strong tea flavor, but still yummy ...",1369526400


In [29]:
DF.head().T

Unnamed: 0,0,1,2,3,4
asin,616719923X,616719923X,616719923X,616719923X,616719923X
helpful,"[0, 0]","[0, 1]","[3, 4]","[0, 0]","[1, 2]"
overall,4,3,4,5,4
reviewText,Just another flavor of Kit Kat but the taste i...,I bought this on impulse and it comes from Jap...,Really good. Great gift for any fan of green t...,"I had never had it before, was curious to see ...",I've been looking forward to trying these afte...
reviewTime,"06 1, 2013","05 19, 2014","10 8, 2013","05 20, 2013","05 26, 2013"
reviewerID,A1VEELTKS8NLZB,A14R9XMZVJ6INB,A27IQHDZFQFNGG,A31QY5TASILE89,A2LWK003FFMCI5
reviewerName,Amazon Customer,amf0001,Caitlin,DebraDownSth,Diana X.
summary,Good Taste,"3.5 stars, sadly not as wonderful as I had hoped",Yum!,Unexpected flavor meld,"Not a very strong tea flavor, but still yummy ..."
unixReviewTime,1370044800,1400457600,1381190400,1369008000,1369526400


In [28]:
DF.tail()

Unnamed: 0,asin,helpful,overall,reviewText,reviewTime,reviewerID,reviewerName,summary,unixReviewTime
151249,B00KCJRVO2,"[0, 0]",4.0,Delicious gluten-free oatmeal: we tried both t...,"07 12, 2014",A2L6QS8SVHT9RG,"randomartco ""period film aficionado""",Delicious gluten-free oatmeal 'quick' packs!,1405123200
151250,B00KCJRVO2,"[0, 0]",4.0,With the many selections of instant oatmeal ce...,"07 6, 2014",AFJFXN42RZ3G2,"R. DelParto ""Rose2""",Convenient and Instant,1404604800
151251,B00KCJRVO2,"[1, 1]",5.0,"While I usually review CDs and DVDs, as well a...","07 1, 2014",ASEBX8TBYWQWA,"Steven I. Ramm ""Steve Ramm &#34;Anything Phon...",Compares favorably in taste and texture with o...,1404172800
151252,B00KCJRVO2,"[0, 1]",4.0,My son and I enjoyed these oatmeal packets. H...,"07 4, 2014",ANKQGTXHREOI5,Titanium Lili,Pretty good!,1404432000
151253,B00KCJRVO2,"[0, 0]",4.0,I like to eat oatmeal i the mornings. I usuall...,"07 11, 2014",A2CF66KIQ3RKX3,Vivian Deliz,I like to eat oatmeal i the mornings,1405036800


## Info

In [30]:
DF.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 151254 entries, 0 to 151253
Data columns (total 9 columns):
asin              151254 non-null object
helpful           151254 non-null object
overall           151254 non-null float64
reviewText        151232 non-null object
reviewTime        151254 non-null object
reviewerID        151254 non-null object
reviewerName      149761 non-null object
summary           151254 non-null object
unixReviewTime    151254 non-null int64
dtypes: float64(1), int64(1), object(7)
memory usage: 10.4+ MB


## Descriptives

In [31]:
DF.describe()

Unnamed: 0,overall,unixReviewTime
count,151254.0,151254.0
mean,4.243042,1342909000.0
std,1.090003,53756340.0
min,1.0,965779200.0
25%,4.0,1315440000.0
50%,5.0,1360368000.0
75%,5.0,1383955000.0
max,5.0,1406074000.0


In [32]:
DF.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
overall,151254.0,4.243042,1.090003,1.0,4.0,5.0,5.0,5.0
unixReviewTime,151254.0,1342909000.0,53756340.0,965779200.0,1315440000.0,1360368000.0,1383955000.0,1406074000.0


# Data Indexing and Slicing

## .loc

In [None]:
DF.loc[row_indexer, column_indexer]

In [35]:
DF.loc[0:5, ['asin', 'overall']]

Unnamed: 0,asin,overall
0,616719923X,4.0
1,616719923X,3.0
2,616719923X,4.0
3,616719923X,5.0
4,616719923X,4.0
5,616719923X,4.0


In [36]:
DF1 = DF.loc[0:5, ['asin', 'overall']]

In [37]:
DF1

Unnamed: 0,asin,overall
0,616719923X,4.0
1,616719923X,3.0
2,616719923X,4.0
3,616719923X,5.0
4,616719923X,4.0
5,616719923X,4.0


In [38]:
DF1.loc[:, 'st_overall'] = DF1['overall']/5

In [39]:
DF1

Unnamed: 0,asin,overall,st_overall
0,616719923X,4.0,0.8
1,616719923X,3.0,0.6
2,616719923X,4.0,0.8
3,616719923X,5.0,1.0
4,616719923X,4.0,0.8
5,616719923X,4.0,0.8


In [40]:
DF1.loc[DF1['overall'] == 4, 'weird'] = DF1['overall']/5

In [41]:
DF1

Unnamed: 0,asin,overall,st_overall,weird
0,616719923X,4.0,0.8,0.8
1,616719923X,3.0,0.6,
2,616719923X,4.0,0.8,0.8
3,616719923X,5.0,1.0,
4,616719923X,4.0,0.8,0.8
5,616719923X,4.0,0.8,0.8


In [42]:
DF1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 4 columns):
asin          6 non-null object
overall       6 non-null float64
st_overall    6 non-null float64
weird         4 non-null float64
dtypes: float64(3), object(1)
memory usage: 272.0+ bytes


## Query Method

In [43]:
DF1.query("overall == 4")

Unnamed: 0,asin,overall,st_overall,weird
0,616719923X,4.0,0.8,0.8
2,616719923X,4.0,0.8,0.8
4,616719923X,4.0,0.8,0.8
5,616719923X,4.0,0.8,0.8


## Set-Reset an Index

In [44]:
D = {
    "x": ["bar", "bar", "foo", "foo"],
    "y": ["a", "b", "a", "b"],
    "z": [1, 2, 3, 4]
}
DF = pd.DataFrame(D)

In [45]:
DF

Unnamed: 0,x,y,z
0,bar,a,1
1,bar,b,2
2,foo,a,3
3,foo,b,4


In [46]:
DF.set_index('x')

Unnamed: 0_level_0,y,z
x,Unnamed: 1_level_1,Unnamed: 2_level_1
bar,a,1
bar,b,2
foo,a,3
foo,b,4


In [47]:
DF.set_index("x", inplace=True)

In [48]:
DF

Unnamed: 0_level_0,y,z
x,Unnamed: 1_level_1,Unnamed: 2_level_1
bar,a,1
bar,b,2
foo,a,3
foo,b,4


In [49]:
DF.reset_index(inplace=True)

In [50]:
DF

Unnamed: 0,x,y,z
0,bar,a,1
1,bar,b,2
2,foo,a,3
3,foo,b,4


In [51]:
DF.set_index(["x", "y"], inplace=True)

In [52]:
DF

Unnamed: 0_level_0,Unnamed: 1_level_0,z
x,y,Unnamed: 2_level_1
bar,a,1
bar,b,2
foo,a,3
foo,b,4
