# Introduction to Pandas

Pandas is a package built on top of NumPy that provides an efficient implementation of a **DataFrame**. 

DataFrames are essentially multidimensional arrays with attached row and column labels, often with heterogeneous types and/or missing data. Pandas implements a number of powerful data operations familiar to users of both database frameworks and spreadsheet programs.


## Learning objectives

1. Fundamental Pandas data structures: the `Series`, `DataFrame`, and `Index`.
2. Indexing 
3. Selection
4. Converting data types
5. Inspection and exploring
6. Renaming, removing, and creating columns
7. Renaming and removing rows

and more


In [2]:
import numpy as np
import pandas as pd

## Pandas data structure: Series

A Pandas Series is a **one-dimensional array** of **indexed data**. 
- Can be created from a list or array or dictionary
- Combines values with **explicitly defined** indices
- like a vector

In [3]:
x = pd.Series([2.3, 5.4, 3, 9])
x

0    2.3
1    5.4
2    3.0
3    9.0
dtype: float64

In [4]:
x.values

array([2.3, 5.4, 3. , 9. ])

In [5]:
x.index

RangeIndex(start=0, stop=4, step=1)

In [6]:
# index
x[0]

2.3

In [7]:
x[1:3]

1    5.4
2    3.0
dtype: float64

In [8]:
x.dtype

dtype('float64')

In [9]:
# explicitly defined index
x = pd.Series([2.3, 5.4, 3, 9], index=["a", "b", "c", "d"])
x

a    2.3
b    5.4
c    3.0
d    9.0
dtype: float64

In [10]:
x["b"]

5.4

In [11]:
x[1]

  x[1]


5.4

#### Series as specialized dictionary

In [12]:
population_dict = {'California': 39538223, 
                   'Texas': 29145505,
                   'Florida': 21538187, 
                   'New York': 20201249,
                   'Pennsylvania': 13002700}
pop = pd.Series(population_dict)
pop

California      39538223
Texas           29145505
Florida         21538187
New York        20201249
Pennsylvania    13002700
dtype: int64

In [13]:
pop["California"]

39538223

In [14]:
pop["California":"Florida"]

California    39538223
Texas         29145505
Florida       21538187
dtype: int64

In [15]:
x = pd.Series(["Mon", "Tue", "Wed", "Thu", "Fri"])
x

0    Mon
1    Tue
2    Wed
3    Thu
4    Fri
dtype: object

In [16]:
x.dtype

dtype('O')

"0" refers to general data type.

Change the datatype using .astype()

For example, `x.astype(int)`, `x.astype(str)`, `x.astype(float)`, `x.astype("category")`

In [17]:
x = x.astype("category")
x

0    Mon
1    Tue
2    Wed
3    Thu
4    Fri
dtype: category
Categories (5, object): ['Fri', 'Mon', 'Thu', 'Tue', 'Wed']

When you convert a `Series` to a categorical type, it can have an order defined, which allows for comparisons between categories. The `ordered` attribute tells you whether the categories are treated as ordered or not.

In [18]:
x.cat.ordered # .cat is necessary

False

In [19]:
x = x.cat.reorder_categories(['Mon', 'Tue', 'Wed', 'Thu', 'Fri'], ordered=True)
x

0    Mon
1    Tue
2    Wed
3    Thu
4    Fri
dtype: category
Categories (5, object): ['Mon' < 'Tue' < 'Wed' < 'Thu' < 'Fri']

In [20]:
x.cat.ordered

True

## Pandas data structure: DataFrame

a DataFrame can be viewed as a **two-dimensional array** with **explicit row and column indices**. You can think of a DataFrame as a sequence of aligned Series objects. 

`DataFrame` is like a matrix. Columns in a DataFrame are `Series`. 

- Each column is a variable. 
- Each row is an observation. 
- Each cell stores a value. 

In [21]:
area_dict = {'California': 423967, 
             'Texas': 695662, 
             'Florida': 170312,
             'New York': 141297, 
             'Pennsylvania': 119280}
area = pd.Series(area_dict)
area

California      423967
Texas           695662
Florida         170312
New York        141297
Pennsylvania    119280
dtype: int64

In [22]:
data = pd.DataFrame({"population": pop, "area": area})
data

Unnamed: 0,population,area
California,39538223,423967
Texas,29145505,695662
Florida,21538187,170312
New York,20201249,141297
Pennsylvania,13002700,119280


In [23]:
# row Index
data.index

Index(['California', 'Texas', 'Florida', 'New York', 'Pennsylvania'], dtype='object')

In [24]:
# column Index
data.columns

Index(['population', 'area'], dtype='object')

In [25]:
data["population"]

California      39538223
Texas           29145505
Florida         21538187
New York        20201249
Pennsylvania    13002700
Name: population, dtype: int64

In addition to using dictionary, a DataFrame object can be created from 
- a list of dicts
- a 2D NumPy array

In [26]:
data = pd.DataFrame([{"a": 1, "b": 2}, {"b": 3, "c": 4}])
data # NaN = not in input dictionary

Unnamed: 0,a,b,c
0,1.0,2,
1,,3,4.0


In [27]:
data = pd.DataFrame(np.random.random(10).reshape(5,2), columns=['feature1', 'feature2'])
data

Unnamed: 0,feature1,feature2
0,0.9808,0.486555
1,0.505539,0.369621
2,0.265156,0.206925
3,0.256667,0.021376
4,0.009362,0.175955


A most common way to create a data frame is from file. 

In [28]:
df = pd.read_csv("iris.csv")
df

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica


## Pandas data structure: Index

`Index` can be thought of either as an **immutable array** or as an **ordered set**. 

Row and column identifiers of a DataFrame are of `Index` type. 

In [29]:
ind = pd.Index([2,3,4,5,6,8,10])
ind

Index([2, 3, 4, 5, 6, 8, 10], dtype='int64')

In [30]:
ind[2:]

Index([4, 5, 6, 8, 10], dtype='int64')

In [31]:
ind.shape

(7,)

In [32]:
ind[0] = -2

TypeError: Index does not support mutable operations

In [None]:
# set operations
indA = pd.Index([1, 3, 5, 7, 9])
indB = pd.Index([2, 3, 5, 7, 11])

In [None]:
indA.union(indB)

Index([1, 2, 3, 5, 7, 9, 11], dtype='int64')

In [None]:
indA.difference(indB)

Index([1, 9], dtype='int64')

In [None]:
indA.intersection(indB)

Index([3, 5, 7], dtype='int64')

In [None]:
x = pd.read_csv("iris.csv")
x.set_index("species") # <- multi layer index

Unnamed: 0_level_0,sepal_length,sepal_width,petal_length,petal_width
species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
setosa,5.1,3.5,1.4,0.2
setosa,4.9,3.0,1.4,0.2
setosa,4.7,3.2,1.3,0.2
setosa,4.6,3.1,1.5,0.2
setosa,5.0,3.6,1.4,0.2
...,...,...,...,...
virginica,6.7,3.0,5.2,2.3
virginica,6.3,2.5,5.0,1.9
virginica,6.5,3.0,5.2,2.0
virginica,6.2,3.4,5.4,2.3


In [None]:
x = pd.read_csv("iris.csv")
x

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica


In [None]:
x.set_index("sepal_length")

Unnamed: 0_level_0,sepal_width,petal_length,petal_width,species
sepal_length,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
5.1,3.5,1.4,0.2,setosa
4.9,3.0,1.4,0.2,setosa
4.7,3.2,1.3,0.2,setosa
4.6,3.1,1.5,0.2,setosa
5.0,3.6,1.4,0.2,setosa
...,...,...,...,...
6.7,3.0,5.2,2.3,virginica
6.3,2.5,5.0,1.9,virginica
6.5,3.0,5.2,2.0,virginica
6.2,3.4,5.4,2.3,virginica


In [None]:
x.reset_index(drop=True)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica


## Indexing

In [None]:
data = pd.Series([0.1, 2.31, -1.2], index=[0, 1, 2])
data

0    0.10
1    2.31
2   -1.20
dtype: float64

In [None]:
data[1]

2.31

In [None]:
data.keys() # same as index, will return index

Index([0, 1, 2], dtype='int64')

In [None]:
list(data.items())

[(0, 0.1), (1, 2.31), (2, -1.2)]

In [None]:
data[7] = 3.141
data

0    0.100
1    2.310
2   -1.200
7    3.141
dtype: float64

In [None]:
# slicing
data[1:3]

1    2.31
2   -1.20
dtype: float64

In [None]:
# masking
data[(data>0) & (data<1)] # output is just zero

0    0.1
dtype: float64

In [None]:
data[[1,2]]

1    2.31
2   -1.20
dtype: float64

Note: If your Series has an explicit integer index, an indexing operation will use the explicit indices, while a slicing operation will use the implicit Python-style indices. 

In [None]:
data = pd.Series([0.1, 2.31, -1.2, 3.14], index=[1,3,5,7])

In [None]:
data[7] # understandable have a nice day

3.14

In [None]:
data[2:4] # hmmmmmmmmmmmm why it do that

5   -1.20
7    3.14
dtype: float64

In [None]:
data[5:7] # wtf

Series([], dtype: float64)

Hmmm, not good. Always confusing. **Use `loc` and `iloc`**

`loc` allows indexing and slicing that always references the explicit index. 

`iloc` allows indexing and slicing that always references the implicit Python-style index. 

In [None]:
data.loc[1]

0.1

In [None]:
data.loc[7]

3.14

In [None]:
data.loc[2:6]

3    2.31
5   -1.20
dtype: float64

In [None]:
data.iloc[0]

0.1

In [None]:
data.iloc[3]

3.14

In [None]:
data.iloc[1:3]

3    2.31
5   -1.20
dtype: float64

In [None]:
data.loc[2:6] == data.iloc[1:3]

3    True
5    True
dtype: bool

## Selection

In [None]:
df = pd.read_csv("titanic.csv")
df.head(5) # like in R!

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [None]:
# select columns
df["age"]

0      22.0
1      38.0
2      26.0
3      35.0
4      35.0
       ... 
886    27.0
887    19.0
888     NaN
889    26.0
890    32.0
Name: age, Length: 891, dtype: float64

In [None]:
df.age # this doesn't work if the column name is not a string or conflict with methods of DataFrame

0      22.0
1      38.0
2      26.0
3      35.0
4      35.0
       ... 
886    27.0
887    19.0
888     NaN
889    26.0
890    32.0
Name: age, Length: 891, dtype: float64

In [None]:
df.iloc[:,1]

0      3
1      1
2      3
3      1
4      3
      ..
886    2
887    1
888    3
889    1
890    3
Name: pclass, Length: 891, dtype: int64

In [None]:
df.iloc[:3, 1]

0    3
1    1
2    3
Name: pclass, dtype: int64

In [None]:
df.loc[df["age"] < 18]

In [None]:
df.loc[df["age"] < 18, ["alive", "sex", "age"]]

In [None]:
df.iloc[0,1] = 0

In [None]:
df.head(1)

## Converting data types

In [None]:
# understand data types
df.dtypes

In [None]:
df["pclass"].unique()

In [None]:
# Convert Pclass from object to category. 
df["pclass"] = df["pclass"].astype("category")
df["pclass"].dtype

## Inspection and exploring

In [None]:
df.shape

In [None]:
df.head(5)

In [None]:
df.tail(5)

In [None]:
df.sample(n=5)

In [None]:
df.sample(frac=0.01)

In [None]:
df.describe()

## renaming columns

In [None]:
orig_colnames = df.columns
orig_colnames

In [None]:
df.columns = list("abcdefghijklmno")
df

In [None]:
df.columns = orig_colnames
df

## removing columns

In [None]:
df.drop("survived", axis=1)

In [None]:
df.drop(columns=["pclass","survived", "sex", "age"])

## transforming and creating columns

In [None]:
df["Fare + Age"] = df["fare"] + df["age"]
df

In [None]:
df["fare"] = np.round(df["fare"],2)
df

### renaming rows

In [None]:
df_sub = df.sample(n=3, random_state=42)
df_sub

In [None]:
df_sub.rename({709:"a", 439:"b", 840:"c"})

In [None]:
df_sub.index=["hello", "world", "!"]
df_sub

In [None]:
df_sub.reset_index(drop=True)

### removing rows

In [None]:
df_sub = df.sample(n=10, random_state=42)
df_sub

In [None]:
df_sub.drop([296,535], axis=0)

In [None]:
idx = df_sub.loc[df_sub["alone"] == True].index
idx

In [None]:
df_sub.drop(idx, axis=0)

In [None]:
df_sub.query("age <= 30 and sex == 'female'")

## Operating

In [None]:
A = pd.DataFrame(np.random.randint(0,10,15).reshape(5,3), columns=["f1", "f2", "f3"])
A

In [None]:
B = pd.DataFrame(np.random.randint(0,10,6).reshape(2,3), columns=["f1", "f2", "f4"])
B

In [None]:
A+B

In [None]:
A - A.iloc[0]

## Missing values

Missing values are quite common in real datasets. Pandas provides useful methods for detecting, removing, and replacing null values in Pandas data structures.

- `isnull`: Generates a Boolean mask indicating missing values
- `notnull`: Opposite of isnull
- `dropna`: Returns a filtered version of the data
- `fillna`: Returns a copy of the data with missing values filled or imputed

In [36]:
df = pd.DataFrame([[1, np.nan, 2], [2, 3, 5], [np.nan, 4, 6]])
df

Unnamed: 0,0,1,2
0,1.0,,2
1,2.0,3.0,5
2,,4.0,6


In [37]:
df.isnull()

Unnamed: 0,0,1,2
0,False,True,False
1,False,False,False
2,True,False,False


In [38]:
df.notnull()

Unnamed: 0,0,1,2
0,True,False,True
1,True,True,True
2,False,True,True


In [39]:
df.dropna()

Unnamed: 0,0,1,2
1,2.0,3.0,5


In [40]:
df.dropna(axis=1)

Unnamed: 0,2
0,2
1,5
2,6


In [41]:
df[3] = np.nan
df

Unnamed: 0,0,1,2,3
0,1.0,,2,
1,2.0,3.0,5,
2,,4.0,6,


In [42]:
df.dropna(axis=1, how="all")

Unnamed: 0,0,1,2
0,1.0,,2
1,2.0,3.0,5
2,,4.0,6


In [43]:
df.dropna(thresh=3)

Unnamed: 0,0,1,2,3
1,2.0,3.0,5,


In [44]:
# fillna
df

Unnamed: 0,0,1,2,3
0,1.0,,2,
1,2.0,3.0,5,
2,,4.0,6,


In [45]:
# fillna with a single value
df.fillna(-100)

Unnamed: 0,0,1,2,3
0,1.0,-100.0,2,-100.0
1,2.0,3.0,5,-100.0
2,-100.0,4.0,6,-100.0


In [46]:
df.fillna(df.mean(axis=0))

Unnamed: 0,0,1,2,3
0,1.0,3.5,2,
1,2.0,3.0,5,
2,1.5,4.0,6,


In [47]:
df.fillna(method="ffill")

  df.fillna(method="ffill")


Unnamed: 0,0,1,2,3
0,1.0,,2,
1,2.0,3.0,5,
2,2.0,4.0,6,


In [48]:
df.fillna(method="bfill")

  df.fillna(method="bfill")


Unnamed: 0,0,1,2,3
0,1.0,3.0,2,
1,2.0,3.0,5,
2,,4.0,6,


## Sorting

In [49]:
iris = pd.read_csv("iris.csv")
iris

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica


In [50]:
sorted_iris = iris.sort_values(by='sepal_length', ascending=True)
sorted_iris

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
13,4.3,3.0,1.1,0.1,setosa
42,4.4,3.2,1.3,0.2,setosa
38,4.4,3.0,1.3,0.2,setosa
8,4.4,2.9,1.4,0.2,setosa
41,4.5,2.3,1.3,0.3,setosa
...,...,...,...,...,...
122,7.7,2.8,6.7,2.0,virginica
118,7.7,2.6,6.9,2.3,virginica
117,7.7,3.8,6.7,2.2,virginica
135,7.7,3.0,6.1,2.3,virginica


In [35]:
sorted_iris = iris.sort_values(by=['sepal_length', 'petal_length'], ascending=[True, False])
sorted_iris
# ascending [True, False] means sepal_length is ascending but petal_length is descending

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
13,4.3,3.0,1.1,0.1,setosa
8,4.4,2.9,1.4,0.2,setosa
38,4.4,3.0,1.3,0.2,setosa
42,4.4,3.2,1.3,0.2,setosa
41,4.5,2.3,1.3,0.3,setosa
...,...,...,...,...,...
118,7.7,2.6,6.9,2.3,virginica
117,7.7,3.8,6.7,2.2,virginica
122,7.7,2.8,6.7,2.0,virginica
135,7.7,3.0,6.1,2.3,virginica


## MultiIndex

The `MultiIndex` represents multiple levels of indexing.

In [51]:
index = [('California', 2010), ('California', 2020),
         ('New York', 2010), ('New York', 2020),
         ('Texas', 2010), ('Texas', 2020)]
populations = [37253956, 39538223, 19378102, 20201249, 25145561, 29145505]
index = pd.MultiIndex.from_tuples(index)
pop = pd.Series(populations, index=index)
pop

California  2010    37253956
            2020    39538223
New York    2010    19378102
            2020    20201249
Texas       2010    25145561
            2020    29145505
dtype: int64

In [52]:
pop["California"]

2010    37253956
2020    39538223
dtype: int64

In [53]:
pop[:,2020]

California    39538223
New York      20201249
Texas         29145505
dtype: int64

In [54]:
df_pop = pd.DataFrame({'total': pop,
                       'under18': [9284094, 8898092, 4318033, 4181528, 6879014, 7432474]})
df_pop

Unnamed: 0,Unnamed: 1,total,under18
California,2010,37253956,9284094
California,2020,39538223,8898092
New York,2010,19378102,4318033
New York,2020,20201249,4181528
Texas,2010,25145561,6879014
Texas,2020,29145505,7432474


In [55]:
df_pop.index

MultiIndex([('California', 2010),
            ('California', 2020),
            (  'New York', 2010),
            (  'New York', 2020),
            (     'Texas', 2010),
            (     'Texas', 2020)],
           )

In [56]:
df_pop.index.names=["state", "year"]

In [57]:
df_pop

Unnamed: 0_level_0,Unnamed: 1_level_0,total,under18
state,year,Unnamed: 2_level_1,Unnamed: 3_level_1
California,2010,37253956,9284094
California,2020,39538223,8898092
New York,2010,19378102,4318033
New York,2020,20201249,4181528
Texas,2010,25145561,6879014
Texas,2020,29145505,7432474


In [58]:
index = pd.MultiIndex.from_product([[2013, 2014], [1, 2]], names=['year', 'visit']) # from_product creates every combination
columns = pd.MultiIndex.from_product([['Bob', 'Guido', 'Sue'], ['HR', 'Temp']], names=['subject', 'type'])
X = np.random.random(24).reshape(4,6)
df = pd.DataFrame(X, index=index, columns=columns)
df

Unnamed: 0_level_0,subject,Bob,Bob,Guido,Guido,Sue,Sue
Unnamed: 0_level_1,type,HR,Temp,HR,Temp,HR,Temp
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
2013,1,0.133589,0.968567,0.748928,0.976557,0.735494,0.960621
2013,2,0.742598,0.119662,0.51401,0.744266,0.53984,0.244346
2014,1,0.08248,0.36803,0.769972,0.449054,0.22368,0.410911
2014,2,0.841142,0.12087,0.867432,0.734298,0.111893,0.352569


In [59]:
df["Bob"]

Unnamed: 0_level_0,type,HR,Temp
year,visit,Unnamed: 2_level_1,Unnamed: 3_level_1
2013,1,0.133589,0.968567
2013,2,0.742598,0.119662
2014,1,0.08248,0.36803
2014,2,0.841142,0.12087


In [60]:
df.loc[[2013]]

Unnamed: 0_level_0,subject,Bob,Bob,Guido,Guido,Sue,Sue
Unnamed: 0_level_1,type,HR,Temp,HR,Temp,HR,Temp
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
2013,1,0.133589,0.968567,0.748928,0.976557,0.735494,0.960621
2013,2,0.742598,0.119662,0.51401,0.744266,0.53984,0.244346


In [61]:
df.loc[(2013, 1),:]

subject  type
Bob      HR      0.133589
         Temp    0.968567
Guido    HR      0.748928
         Temp    0.976557
Sue      HR      0.735494
         Temp    0.960621
Name: (2013, 1), dtype: float64

In [63]:
df.loc[:,("Bob", "HR")]

year  visit
2013  1        0.133589
      2        0.742598
2014  1        0.082480
      2        0.841142
Name: (Bob, HR), dtype: float64

In [64]:
idx = pd.IndexSlice
df.loc[idx[:, 1], idx[:, 'HR']]

Unnamed: 0_level_0,subject,Bob,Guido,Sue
Unnamed: 0_level_1,type,HR,HR,HR
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
2013,1,0.133589,0.748928,0.735494
2014,1,0.08248,0.769972,0.22368


In [65]:
df = pd.read_csv('titanic.csv')
df # comparing survival rates by categorical variables

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.2500,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.9250,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.0500,S,Third,man,True,,Southampton,no,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,male,27.0,0,0,13.0000,S,Second,man,True,,Southampton,no,True
887,1,1,female,19.0,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True
888,0,3,female,,1,2,23.4500,S,Third,woman,False,,Southampton,no,False
889,1,1,male,26.0,0,0,30.0000,C,First,man,True,C,Cherbourg,yes,True


In [66]:
df.groupby(['sex']).count() # agreggates all observations by sex

Unnamed: 0_level_0,survived,pclass,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
female,314,314,261,314,314,314,312,314,314,314,97,312,314,314
male,577,577,453,577,577,577,577,577,577,577,106,577,577,577


In [73]:
df.groupby(['sex', 'pclass']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,survived,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
sex,pclass,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
female,1,94,85,94,94,94,92,94,94,94,81,92,94,94
female,2,76,74,76,76,76,76,76,76,76,10,76,76,76
female,3,144,102,144,144,144,144,144,144,144,6,144,144,144
male,1,122,101,122,122,122,122,122,122,122,94,122,122,122
male,2,108,99,108,108,108,108,108,108,108,6,108,108,108
male,3,347,253,347,347,347,347,347,347,347,6,347,347,347


In [74]:
df.groupby(['pclass', 'sex']).count() # the order matters

Unnamed: 0_level_0,Unnamed: 1_level_0,survived,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
pclass,sex,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
1,female,94,85,94,94,94,92,94,94,94,81,92,94,94
1,male,122,101,122,122,122,122,122,122,122,94,122,122,122
2,female,76,74,76,76,76,76,76,76,76,10,76,76,76
2,male,108,99,108,108,108,108,108,108,108,6,108,108,108
3,female,144,102,144,144,144,144,144,144,144,6,144,144,144
3,male,347,253,347,347,347,347,347,347,347,6,347,347,347


## Combining datasets

### `concat` 

In [75]:
ser1 = pd.Series(['A', 'B', 'C'], index=[1, 2, 3])
ser2 = pd.Series(['D', 'E', 'F'], index=[4, 5, 6])
pd.concat([ser1, ser2])

1    A
2    B
3    C
4    D
5    E
6    F
dtype: object

In [76]:
df1 = pd.DataFrame(np.random.random((5,2)), columns=["A", "B"])
df2 = pd.DataFrame(np.random.random((5,2)), columns=["A", "B"])

In [77]:
df1

Unnamed: 0,A,B
0,0.020305,0.36703
1,0.991624,0.684269
2,0.607239,0.393991
3,0.891725,0.789874
4,0.408441,0.83733


In [78]:
df2

Unnamed: 0,A,B
0,0.981341,0.497283
1,0.152851,0.481413
2,0.05248,0.449024
3,0.119308,0.709944
4,0.165702,0.863478


In [79]:
pd.concat([df1, df2])

Unnamed: 0,A,B
0,0.020305,0.36703
1,0.991624,0.684269
2,0.607239,0.393991
3,0.891725,0.789874
4,0.408441,0.83733
0,0.981341,0.497283
1,0.152851,0.481413
2,0.05248,0.449024
3,0.119308,0.709944
4,0.165702,0.863478


In [80]:
pd.concat([df1, df2], axis=1)

Unnamed: 0,A,B,A.1,B.1
0,0.020305,0.36703,0.981341,0.497283
1,0.991624,0.684269,0.152851,0.481413
2,0.607239,0.393991,0.05248,0.449024
3,0.891725,0.789874,0.119308,0.709944
4,0.408441,0.83733,0.165702,0.863478


In [None]:
# fix duplicate indices

In [81]:
pd.concat([df1, df2], verify_integrity=True) # error because common index

ValueError: Indexes have overlapping values: Index([0, 1, 2, 3, 4], dtype='int64')

In [82]:
pd.concat([df1, df2], ignore_index=True) # creates new index for new table

Unnamed: 0,A,B
0,0.020305,0.36703
1,0.991624,0.684269
2,0.607239,0.393991
3,0.891725,0.789874
4,0.408441,0.83733
5,0.981341,0.497283
6,0.152851,0.481413
7,0.05248,0.449024
8,0.119308,0.709944
9,0.165702,0.863478


### `append`

In [84]:
df1.append(df2) # does not work / OLD

AttributeError: 'DataFrame' object has no attribute 'concat'

In [85]:
df1.append(df2, ignore_index=True) # works for numpy and NOT pandas

AttributeError: 'DataFrame' object has no attribute 'append'

# In-class activity: Divide the class into 22 groups for final project randomly. Each group has 4-5 students.

# Final project groups

In [110]:
df = pd.read_csv('students_list.csv')
df


Unnamed: 0,First Name,Last Name,Email
0,Abhimanyu,Agashe,manyu@unc.edu
1,Abhishri,Agrawal,abhishri@unc.edu
2,Adam,Zawati,zawati@unc.edu
3,Adil,Syed,adilsyed@unc.edu
4,Aditi,Patil,aditipat@unc.edu
...,...,...,...
96,Yizhe,Yang,yangy23@unc.edu
97,Zahra,Alqudaihi,zahraq58@ad.unc.edu
98,Zhaojiayi,Zhang,zzhang27@unc.edu
99,Zheyuan,Liu,zheyuan@ad.unc.edu


For example, there are 8 groups of 4 students and 14 groups of 5 students.

In [111]:
group_sizes = [4] * 9 + [5] * 13
group_sizes # making nine groups of 4 and thirteen groups of 5

[4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5]

In [112]:
groupid = []
for id_group, group_size in enumerate(group_sizes):
    groupid = np.hstack([groupid, np.ones(group_size) * (id_group + 1)]) # ID STARTS FROM ZERO

groupid 

array([ 1.,  1.,  1.,  1.,  2.,  2.,  2.,  2.,  3.,  3.,  3.,  3.,  4.,
        4.,  4.,  4.,  5.,  5.,  5.,  5.,  6.,  6.,  6.,  6.,  7.,  7.,
        7.,  7.,  8.,  8.,  8.,  8.,  9.,  9.,  9.,  9., 10., 10., 10.,
       10., 10., 11., 11., 11., 11., 11., 12., 12., 12., 12., 12., 13.,
       13., 13., 13., 13., 14., 14., 14., 14., 14., 15., 15., 15., 15.,
       15., 16., 16., 16., 16., 16., 17., 17., 17., 17., 17., 18., 18.,
       18., 18., 18., 19., 19., 19., 19., 19., 20., 20., 20., 20., 20.,
       21., 21., 21., 21., 21., 22., 22., 22., 22., 22.])

In [113]:
np.random.shuffle(groupid)
groupid

array([11., 15.,  7.,  2., 14.,  7.,  3.,  2., 18., 11., 16., 13., 14.,
        9., 13.,  8., 10., 12.,  5., 18., 19., 20.,  6., 15.,  7.,  8.,
        3.,  9.,  9.,  5., 21., 22., 10.,  2., 22., 20., 21., 19., 18.,
       12., 11., 17.,  8., 10., 11.,  4., 17., 14.,  1.,  6., 16., 10.,
       15., 18.,  1.,  5., 22., 22., 13., 18.,  4., 20., 13.,  9.,  4.,
       20., 19.,  2., 17., 15., 12., 16., 17., 22., 21., 10.,  6.,  8.,
       16., 17.,  3.,  4., 14., 21.,  7., 12.,  6., 13.,  1., 19., 21.,
        3., 12., 16., 20., 15.,  1., 14., 19., 11.,  5.])

In [114]:
df["group"] = groupid
df

Unnamed: 0,First Name,Last Name,Email,group
0,Abhimanyu,Agashe,manyu@unc.edu,11.0
1,Abhishri,Agrawal,abhishri@unc.edu,15.0
2,Adam,Zawati,zawati@unc.edu,7.0
3,Adil,Syed,adilsyed@unc.edu,2.0
4,Aditi,Patil,aditipat@unc.edu,14.0
...,...,...,...,...
96,Yizhe,Yang,yangy23@unc.edu,1.0
97,Zahra,Alqudaihi,zahraq58@ad.unc.edu,14.0
98,Zhaojiayi,Zhang,zzhang27@unc.edu,19.0
99,Zheyuan,Liu,zheyuan@ad.unc.edu,11.0


In [115]:
# removing floats
df["group"] = df["group"].apply(int)
df

Unnamed: 0,First Name,Last Name,Email,group
0,Abhimanyu,Agashe,manyu@unc.edu,11
1,Abhishri,Agrawal,abhishri@unc.edu,15
2,Adam,Zawati,zawati@unc.edu,7
3,Adil,Syed,adilsyed@unc.edu,2
4,Aditi,Patil,aditipat@unc.edu,14
...,...,...,...,...
96,Yizhe,Yang,yangy23@unc.edu,1
97,Zahra,Alqudaihi,zahraq58@ad.unc.edu,14
98,Zhaojiayi,Zhang,zzhang27@unc.edu,19
99,Zheyuan,Liu,zheyuan@ad.unc.edu,11
