# Pandas Concepts #2/4

## Dataframe Basics

Please see the Pandas documentation for more information:

* [https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html)

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv("https://raw.githubusercontent.com/huangjia2019/house/master/house.csv")

In [3]:
df.info

<bound method DataFrame.info of        longitude  latitude  housing_median_age  total_rooms  total_bedrooms  \
0        -114.31     34.19                15.0       5612.0          1283.0   
1        -114.47     34.40                19.0       7650.0          1901.0   
2        -114.56     33.69                17.0        720.0           174.0   
3        -114.57     33.64                14.0       1501.0           337.0   
4        -114.57     33.57                20.0       1454.0           326.0   
...          ...       ...                 ...          ...             ...   
16995    -124.26     40.58                52.0       2217.0           394.0   
16996    -124.27     40.69                36.0       2349.0           528.0   
16997    -124.30     41.84                17.0       2677.0           531.0   
16998    -124.30     41.80                19.0       2672.0           552.0   
16999    -124.35     40.54                52.0       1820.0           300.0   

       population  

### What are and how many columns?

In [4]:
df.columns

Index(['longitude', 'latitude', 'housing_median_age', 'total_rooms',
       'total_bedrooms', 'population', 'households', 'median_income',
       'median_house_value'],
      dtype='object')

In [5]:
len(df.columns)

9

### How many rows?

In [6]:
len(df)

17000

### How many rows and columns (in one command)?

In [7]:
df.shape

(17000, 9)

# Accessing Data

## Column data by name

In [8]:
df['median_income']

0        1.4936
1        1.8200
2        1.6509
3        3.1917
4        1.9250
          ...  
16995    2.3571
16996    2.5179
16997    3.0313
16998    1.9797
16999    3.0147
Name: median_income, Length: 17000, dtype: float64

In [9]:
df.loc[:,'median_income'] # series return result

0        1.4936
1        1.8200
2        1.6509
3        3.1917
4        1.9250
          ...  
16995    2.3571
16996    2.5179
16997    3.0313
16998    1.9797
16999    3.0147
Name: median_income, Length: 17000, dtype: float64

In [10]:
df.loc[:,['median_income']] # dataframe return result

Unnamed: 0,median_income
0,1.4936
1,1.8200
2,1.6509
3,3.1917
4,1.9250
...,...
16995,2.3571
16996,2.5179
16997,3.0313
16998,1.9797


In [11]:
df[['median_income', 'median_house_value']]

Unnamed: 0,median_income,median_house_value
0,1.4936,66900.0
1,1.8200,80100.0
2,1.6509,85700.0
3,3.1917,73400.0
4,1.9250,65500.0
...,...,...
16995,2.3571,111400.0
16996,2.5179,79000.0
16997,3.0313,103600.0
16998,1.9797,85800.0


In [12]:
df.loc[:,['median_income', 'median_house_value']]

Unnamed: 0,median_income,median_house_value
0,1.4936,66900.0
1,1.8200,80100.0
2,1.6509,85700.0
3,3.1917,73400.0
4,1.9250,65500.0
...,...,...
16995,2.3571,111400.0
16996,2.5179,79000.0
16997,3.0313,103600.0
16998,1.9797,85800.0


## Row data by name (index name)

In [13]:
df.loc[1]

longitude              -114.47
latitude                 34.40
housing_median_age       19.00
total_rooms            7650.00
total_bedrooms         1901.00
population             1129.00
households              463.00
median_income             1.82
median_house_value    80100.00
Name: 1, dtype: float64

In [14]:
df.loc[110]

longitude              -115.7300
latitude                 33.3500
housing_median_age       23.0000
total_rooms            1586.0000
total_bedrooms          448.0000
population              338.0000
households              182.0000
median_income             1.2132
median_house_value    30000.0000
Name: 110, dtype: float64

In [15]:
df.loc[[1,110],:]

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
1,-114.47,34.4,19.0,7650.0,1901.0,1129.0,463.0,1.82,80100.0
110,-115.73,33.35,23.0,1586.0,448.0,338.0,182.0,1.2132,30000.0


## Row and Column data by index or numeric location (index)

In [16]:
df.iloc[0]

longitude              -114.3100
latitude                 34.1900
housing_median_age       15.0000
total_rooms            5612.0000
total_bedrooms         1283.0000
population             1015.0000
households              472.0000
median_income             1.4936
median_house_value    66900.0000
Name: 0, dtype: float64

In [17]:
df.iloc[75]

longitude              -115.5500
latitude                 32.7800
housing_median_age        5.0000
total_rooms            2652.0000
total_bedrooms          606.0000
population             1767.0000
households              536.0000
median_income             2.8025
median_house_value    84300.0000
Name: 75, dtype: float64

In [18]:
df.iloc[:,4] # column 4 is "total_bedrooms"

0        1283.0
1        1901.0
2         174.0
3         337.0
4         326.0
          ...  
16995     394.0
16996     528.0
16997     531.0
16998     552.0
16999     300.0
Name: total_bedrooms, Length: 17000, dtype: float64

In [19]:
df.iloc[:,[4,6]]

Unnamed: 0,total_bedrooms,households
0,1283.0,472.0
1,1901.0,463.0
2,174.0,117.0
3,337.0,226.0
4,326.0,262.0
...,...,...
16995,394.0,369.0
16996,528.0,465.0
16997,531.0,456.0
16998,552.0,478.0


In [20]:
df.iloc[1:100,[4,6]]

Unnamed: 0,total_bedrooms,households
1,1901.0,463.0
2,174.0,117.0
3,337.0,226.0
4,326.0,262.0
5,236.0,239.0
...,...,...
95,143.0,143.0
96,203.0,201.0
97,507.0,451.0
98,414.0,421.0


In [21]:
df.iloc[[2,4,5,20],[4,6]]

Unnamed: 0,total_bedrooms,households
2,174.0,117.0
4,326.0,262.0
5,236.0,239.0
20,360.0,303.0


## Using 

In [22]:
df[:10] # first 10 rows

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,1.4936,66900.0
1,-114.47,34.4,19.0,7650.0,1901.0,1129.0,463.0,1.82,80100.0
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0
3,-114.57,33.64,14.0,1501.0,337.0,515.0,226.0,3.1917,73400.0
4,-114.57,33.57,20.0,1454.0,326.0,624.0,262.0,1.925,65500.0
5,-114.58,33.63,29.0,1387.0,236.0,671.0,239.0,3.3438,74000.0
6,-114.58,33.61,25.0,2907.0,680.0,1841.0,633.0,2.6768,82400.0
7,-114.59,34.83,41.0,812.0,168.0,375.0,158.0,1.7083,48500.0
8,-114.59,33.61,34.0,4789.0,1175.0,3134.0,1056.0,2.1782,58400.0
9,-114.6,34.83,46.0,1497.0,309.0,787.0,271.0,2.1908,48100.0


In [23]:
df[:10].iloc[:,:10] # first 10 rows and 10 columns

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,1.4936,66900.0
1,-114.47,34.4,19.0,7650.0,1901.0,1129.0,463.0,1.82,80100.0
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0
3,-114.57,33.64,14.0,1501.0,337.0,515.0,226.0,3.1917,73400.0
4,-114.57,33.57,20.0,1454.0,326.0,624.0,262.0,1.925,65500.0
5,-114.58,33.63,29.0,1387.0,236.0,671.0,239.0,3.3438,74000.0
6,-114.58,33.61,25.0,2907.0,680.0,1841.0,633.0,2.6768,82400.0
7,-114.59,34.83,41.0,812.0,168.0,375.0,158.0,1.7083,48500.0
8,-114.59,33.61,34.0,4789.0,1175.0,3134.0,1056.0,2.1782,58400.0
9,-114.6,34.83,46.0,1497.0,309.0,787.0,271.0,2.1908,48100.0


In [24]:
df.iloc[:10,:10]

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,1.4936,66900.0
1,-114.47,34.4,19.0,7650.0,1901.0,1129.0,463.0,1.82,80100.0
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0
3,-114.57,33.64,14.0,1501.0,337.0,515.0,226.0,3.1917,73400.0
4,-114.57,33.57,20.0,1454.0,326.0,624.0,262.0,1.925,65500.0
5,-114.58,33.63,29.0,1387.0,236.0,671.0,239.0,3.3438,74000.0
6,-114.58,33.61,25.0,2907.0,680.0,1841.0,633.0,2.6768,82400.0
7,-114.59,34.83,41.0,812.0,168.0,375.0,158.0,1.7083,48500.0
8,-114.59,33.61,34.0,4789.0,1175.0,3134.0,1056.0,2.1782,58400.0
9,-114.6,34.83,46.0,1497.0,309.0,787.0,271.0,2.1908,48100.0


## ... and for more complex slices ...

In [25]:
pd.concat(
    [df.iloc[2:30,[4,6]], 
     df.iloc[500:600,[4,6]]]
)

Unnamed: 0,total_bedrooms,households
2,174.0,117.0
3,337.0,226.0
4,326.0,262.0
5,236.0,239.0
6,680.0,633.0
...,...,...
595,592.0,631.0
596,516.0,486.0
597,675.0,702.0
598,184.0,182.0


## Adding Columns

In [26]:
df.columns

Index(['longitude', 'latitude', 'housing_median_age', 'total_rooms',
       'total_bedrooms', 'population', 'households', 'median_income',
       'median_house_value'],
      dtype='object')

In [27]:
df['br_pop_ratio'] = \
    round(df['total_bedrooms'] / df['population'], 2)

In [28]:
df.columns

Index(['longitude', 'latitude', 'housing_median_age', 'total_rooms',
       'total_bedrooms', 'population', 'households', 'median_income',
       'median_house_value', 'br_pop_ratio'],
      dtype='object')

In [29]:
df['br_pop_ratio'].min()

0.0

In [30]:
df['br_pop_ratio'].max()

14.19

What's going -- all mansions???

In [31]:
df.query('br_pop_ratio == 14.19')

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,br_pop_ratio
10317,-120.08,38.8,34.0,1988.0,511.0,36.0,15.0,4.625,162500.0,14.19


At least we know where it is!

## Adding Rows

Use `concat()`:

* [https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html#pandas.concat](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html#pandas.concat)

In [32]:
df.iloc[[0]]*100

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,br_pop_ratio
0,-11431.0,3419.0,1500.0,561200.0,128300.0,101500.0,47200.0,149.36,6690000.0,126.0


In [33]:
pd.concat([df, df.iloc[[0]]*100])

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,br_pop_ratio
0,-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,1.4936,66900.0,1.26
1,-114.47,34.40,19.0,7650.0,1901.0,1129.0,463.0,1.8200,80100.0,1.68
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0,0.52
3,-114.57,33.64,14.0,1501.0,337.0,515.0,226.0,3.1917,73400.0,0.65
4,-114.57,33.57,20.0,1454.0,326.0,624.0,262.0,1.9250,65500.0,0.52
...,...,...,...,...,...,...,...,...,...,...
16996,-124.27,40.69,36.0,2349.0,528.0,1194.0,465.0,2.5179,79000.0,0.44
16997,-124.30,41.84,17.0,2677.0,531.0,1244.0,456.0,3.0313,103600.0,0.43
16998,-124.30,41.80,19.0,2672.0,552.0,1298.0,478.0,1.9797,85800.0,0.43
16999,-124.35,40.54,52.0,1820.0,300.0,806.0,270.0,3.0147,94600.0,0.37


Go to the next tutorial:

* [./pandas_pt1_reading_data.ipynb](./pandas_pt1_reading_data.ipynb)

Or jump to the others:

* [./pandas_pt2_dataframe_operations.ipynb](./pandas_pt2_dataframe_operations.ipynb)
* [./pandas_pt3_more_filtering.ipynb](./pandas_pt3_more_filtering.ipynb)
* [./pandas_pt4_basic_stats](./pandas_pt4_basic_stats.ipynb)