# Pandas

- Short for panel data, a term that economists use.
- It provides two data structures which make data analysis easier.
    - Series
    - Dataframe
- Behind the scenes, it uses Numpy arrays for all computations.

Pandas is a third party library. To install it,

```
pip install pandas
```

# Series

Useful for representing 1-d data.

In [1]:
import pandas as pd

In [2]:
s = pd.Series([1,2,3,4,5])

In [3]:
type(s)

pandas.core.series.Series

In [4]:
s = pd.Series((1,2,3,4,5))

In [5]:
import numpy as np

In [6]:
arr = np.array([1,2,3,4,5])

In [7]:
s = pd.Series(arr)

In [8]:
s

0    1
1    2
2    3
3    4
4    5
dtype: int32

In [9]:
print(s)

0    1
1    2
2    3
3    4
4    5
dtype: int32


Main difference between Numpy 1-d array and a Pandas Series: The Index

In [10]:
s2 = pd.Series([1,2,3,4,5], index=['a', 'b', 'c', 'd', 'e'])

In [11]:
s2

a    1
b    2
c    3
d    4
e    5
dtype: int64

In [12]:
s2['a']

1

In [13]:
s2[0]

1

In [14]:
s3 = pd.Series([1,2,3,4,5], index=[6,7,8,9,10])

In [15]:
s3

6     1
7     2
8     3
9     4
10    5
dtype: int64

In [16]:
s3[0]

KeyError: 0

You can access data in a series using both position and index. But when your index itself is made of numbers, then it becomes very confusing. So, a preferred notation is as follows.

Access by index.

In [17]:
s3.loc[6]

1

Access by position.

In [18]:
s3.iloc[0]

1

In [19]:
s4 = pd.Series([1,2,3,4,5], index=['a','a','b','c','d'])

In [20]:
s4

a    1
a    2
b    3
c    4
d    5
dtype: int64

In [21]:
s4['a']

a    1
a    2
dtype: int64

# Dataframes

Used to represent 2-d data. It is a collection of multiple Series.

There are many different ways to construct dataframes.

## From 2-d arrays.

In [22]:
arr2d = np.random.randint(10, 99, size=(10, 5))

In [23]:
arr2d

array([[16, 86, 59, 56, 41],
       [49, 42, 28, 86, 95],
       [51, 52, 63, 12, 98],
       [64, 32, 81, 50, 68],
       [29, 76, 72, 63, 15],
       [21, 40, 65, 20, 64],
       [96, 11, 95, 19, 65],
       [52, 48, 14, 92, 93],
       [11, 97, 51, 14, 56],
       [19, 20, 46, 17, 88]])

In [24]:
df = pd.DataFrame(arr2d)

In [25]:
df

Unnamed: 0,0,1,2,3,4
0,16,86,59,56,41
1,49,42,28,86,95
2,51,52,63,12,98
3,64,32,81,50,68
4,29,76,72,63,15
5,21,40,65,20,64
6,96,11,95,19,65
7,52,48,14,92,93
8,11,97,51,14,56
9,19,20,46,17,88


In [26]:
df = pd.DataFrame(arr2d, columns=['Arjun', 'Ajitesh', 'Arpita', 'Shubham', 'Yashasvi'])

In [27]:
df

Unnamed: 0,Arjun,Ajitesh,Arpita,Shubham,Yashasvi
0,16,86,59,56,41
1,49,42,28,86,95
2,51,52,63,12,98
3,64,32,81,50,68
4,29,76,72,63,15
5,21,40,65,20,64
6,96,11,95,19,65
7,52,48,14,92,93
8,11,97,51,14,56
9,19,20,46,17,88


In [28]:
df = pd.DataFrame(arr2d,
    columns=['Arjun', 'Ajitesh', 'Arpita', 'Shubham', 'Yashasvi'],
    index=['a', 'b', 'c', 'd', 'e', 'f' , 'g', 'h', 'i', 'j'])

In [29]:
df

Unnamed: 0,Arjun,Ajitesh,Arpita,Shubham,Yashasvi
a,16,86,59,56,41
b,49,42,28,86,95
c,51,52,63,12,98
d,64,32,81,50,68
e,29,76,72,63,15
f,21,40,65,20,64
g,96,11,95,19,65
h,52,48,14,92,93
i,11,97,51,14,56
j,19,20,46,17,88


## From a list of lists/tuples

In [30]:
X = [
    ("John", 25, "Pune"),
    ("James", 28, "Jhoomritallaiya"),
    ("Jane", 21, "Lonavala")
]

In [33]:
df2 = pd.DataFrame(X, columns=["Name", "Age", "Origin"])

In [34]:
df2

Unnamed: 0,Name,Age,Origin
0,John,25,Pune
1,James,28,Jhoomritallaiya
2,Jane,21,Lonavala


## From a Dictionary

In [35]:
a_dictionary = {
    "Name": ["John", "James", "Jane"],
    "Age": [25, 28, 21],
    "Origin": ["Pune", "Jhoomritallaiya", "Lonavala"]
}

In [36]:
df3 = pd.DataFrame(a_dictionary)

In [37]:
df3

Unnamed: 0,Name,Age,Origin
0,John,25,Pune
1,James,28,Jhoomritallaiya
2,Jane,21,Lonavala


In [38]:
a_dictionary2 = {
    "Name": ["John", "James", "Jane"],
    "Age": [25, 28],
    "Origin": ["Pune", "Jhoomritallaiya", "Lonavala"]
}

In [39]:
df3_2 = pd.DataFrame(a_dictionary2)

ValueError: arrays must all be same length

## From a List of Dictionaries

In [41]:
list_of_dictionaries = [
    {"Name": "John", "Age": 25, "Location": "X"},
    {"Name": "James", "Age": 78, "Location": "X-23a"},
    {"Name": "Jane", "Age": 21, "Location": "Y-2"}
]

In [42]:
df4 = pd.DataFrame(list_of_dictionaries)

In [43]:
df4

Unnamed: 0,Name,Age,Location
0,John,25,X
1,James,78,X-23a
2,Jane,21,Y-2


# Loading a Dataframe from a File

In [44]:
df = pd.read_csv("housing.csv")

In [45]:
df

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY
...,...,...,...,...,...,...,...,...,...,...
20635,-121.09,39.48,25.0,1665.0,374.0,845.0,330.0,1.5603,78100.0,INLAND
20636,-121.21,39.49,18.0,697.0,150.0,356.0,114.0,2.5568,77100.0,INLAND
20637,-121.22,39.43,17.0,2254.0,485.0,1007.0,433.0,1.7000,92300.0,INLAND
20638,-121.32,39.43,18.0,1860.0,409.0,741.0,349.0,1.8672,84700.0,INLAND


In [46]:
df.shape

(20640, 10)

In [47]:
df.columns

Index(['longitude', 'latitude', 'housing_median_age', 'total_rooms',
       'total_bedrooms', 'population', 'households', 'median_income',
       'median_house_value', 'ocean_proximity'],
      dtype='object')

In [48]:
df.index

RangeIndex(start=0, stop=20640, step=1)

In [49]:
df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


In [50]:
df.head(2)

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY


In [51]:
df.tail()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
20635,-121.09,39.48,25.0,1665.0,374.0,845.0,330.0,1.5603,78100.0,INLAND
20636,-121.21,39.49,18.0,697.0,150.0,356.0,114.0,2.5568,77100.0,INLAND
20637,-121.22,39.43,17.0,2254.0,485.0,1007.0,433.0,1.7,92300.0,INLAND
20638,-121.32,39.43,18.0,1860.0,409.0,741.0,349.0,1.8672,84700.0,INLAND
20639,-121.24,39.37,16.0,2785.0,616.0,1387.0,530.0,2.3886,89400.0,INLAND


In [52]:
df.tail(2)

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
20638,-121.32,39.43,18.0,1860.0,409.0,741.0,349.0,1.8672,84700.0,INLAND
20639,-121.24,39.37,16.0,2785.0,616.0,1387.0,530.0,2.3886,89400.0,INLAND


In [53]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           20640 non-null  float64
 1   latitude            20640 non-null  float64
 2   housing_median_age  20640 non-null  float64
 3   total_rooms         20640 non-null  float64
 4   total_bedrooms      20433 non-null  float64
 5   population          20640 non-null  float64
 6   households          20640 non-null  float64
 7   median_income       20640 non-null  float64
 8   median_house_value  20640 non-null  float64
 9   ocean_proximity     20640 non-null  object 
dtypes: float64(9), object(1)
memory usage: 1.6+ MB


In [54]:
df.describe()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
count,20640.0,20640.0,20640.0,20640.0,20433.0,20640.0,20640.0,20640.0,20640.0
mean,-119.569704,35.631861,28.639486,2635.763081,537.870553,1425.476744,499.53968,3.870671,206855.816909
std,2.003532,2.135952,12.585558,2181.615252,421.38507,1132.462122,382.329753,1.899822,115395.615874
min,-124.35,32.54,1.0,2.0,1.0,3.0,1.0,0.4999,14999.0
25%,-121.8,33.93,18.0,1447.75,296.0,787.0,280.0,2.5634,119600.0
50%,-118.49,34.26,29.0,2127.0,435.0,1166.0,409.0,3.5348,179700.0
75%,-118.01,37.71,37.0,3148.0,647.0,1725.0,605.0,4.74325,264725.0
max,-114.31,41.95,52.0,39320.0,6445.0,35682.0,6082.0,15.0001,500001.0
