# Python session - 3.2

## Pandas DataFrames

https://swcarpentry.github.io/python-novice-gapminder/08-data-frames/

#### Questions
- How can I do statistical analysis of tabular data?

#### Objectives
- Select individual values from a Pandas dataframe.
- Select entire rows or entire columns from a dataframe.
- Select a subset of both rows and columns from a dataframe in a single operation.
- Select a subset of a dataframe by a single Boolean criterion.

#### First note about Pandas DataFrames/Series

A `DataFrame` is a collection of `Series`; The `DataFrame` is the way Pandas represents a table, and `Series` is the data-structure Pandas uses to represent a column.

Pandas is built on top of the Numpy library, which in practice means that most of the methods defined for Numpy's Arrays apply to Pandas' `Series`/`DataFrames`.

What makes Pandas so attractive is the powerful interface to access individual records of the table, proper handling of missing values, and relational-databases operations between `DataFrames`.

#### Use `DataFrame.iloc[..., ...]` to select values by numerical index

Can specify location by numerical index analogously to 2D version of character selection in strings.

In [23]:
import pandas
import pandas as pd

#Hints:
#index_col='country'
#data.iloc

# read file, make 'country' as an index for the column
df = pd.read_csv("data/gapminder_gdp_asia.csv", index_col = "country")

# index of the indexes of the locations or rows: iloc
df.iloc[0, 1]

#similar to the columns, a list of indexes of rows can be passed
df.iloc[range(1,6)]

#the left side of , inside the square bracket is used for rows and right side is for column

# get value from the first cell
df.iloc[0, 0]

# get values from multiple cells
df.iloc[range(1,6), range(1,6)]

# get all the rows but a few columns
df.iloc[:, [3, 5, 8, 9]]

# get all the columns but a few rows
df.iloc[[3, 5, 8, 9], :]

# in the last commands, ':' represents slicing
#: with a number defines where to slice, but without number it takes everything

#save the subset (sliced_data) in a variable
sliced_data = df.iloc[[3, 5, 8, 9], :]

#get statistics of the sliced data
sliced_data.describe()


Unnamed: 0,gdpPercap_1952,gdpPercap_1957,gdpPercap_1962,gdpPercap_1967,gdpPercap_1972,gdpPercap_1977,gdpPercap_1982,gdpPercap_1987,gdpPercap_1992,gdpPercap_1997,gdpPercap_2002,gdpPercap_2007
count,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0
mean,2646.995638,3395.6765,4429.657384,5389.896686,6981.852093,9571.985896,9327.811925,9752.205579,9105.300015,10112.936864,11184.180116,14378.883438
std,1602.819029,2371.110966,3209.295533,3518.770347,4414.876267,6218.1626,6658.906301,8191.188436,10772.85156,12576.345406,13136.645424,17403.87711
min,368.469286,434.038336,496.913648,523.432314,421.624026,524.972183,624.475478,683.895573,682.303175,734.28517,896.226015,1713.778686
25%,2368.611823,2576.202816,3264.725763,4560.906932,6342.352115,8520.848983,5862.369821,5153.134922,2979.806309,2490.751139,3517.094488,3781.741101
50%,3044.873605,3459.66705,4439.989037,6052.347309,8945.98287,11537.368165,11063.120856,9143.227026,5490.646937,5669.915048,6815.739643,8038.388198
75%,3323.257421,4279.140733,5604.920658,6881.337063,9585.482849,12588.505077,14528.56296,13742.297683,11616.140643,13292.100773,14482.825271,18635.530535
max,4129.766056,6229.333562,8341.737815,8931.459811,9613.818607,14688.23507,14560.53051,20038.47269,24757.60301,28377.63219,30209.01516,39724.97867


#### Use `DataFrame.loc[..., ...]` to select values by names.

Can specify location by row name analogously to 2D version of dictionary keys.

In [40]:
# Using asia data here

# If our file didn't have a column header
# data = pandas.read_csv("data/gapminder_gdp_asia.csv", header=None)

data = pandas.read_csv("data/gapminder_gdp_asia.csv", index_col='country')

data.loc["India", "gdpPercap_1952"]

# loc is like iloc, but using iloc, you can give index number for bothe rows and columns
# whereas for loc, you have to use keys/index names

# again, in the left side of a ',' inside the square bracket is reserved for rows, but the rightside is for column

546.5657493

#### Use `:` on its own to mean all columns or all rows.

Just like Python’s usual slicing notation.

In [41]:
# Slice by "China", :

data.loc["China", :]

gdpPercap_1952     400.448611
gdpPercap_1957     575.987001
gdpPercap_1962     487.674018
gdpPercap_1967     612.705693
gdpPercap_1972     676.900092
gdpPercap_1977     741.237470
gdpPercap_1982     962.421380
gdpPercap_1987    1378.904018
gdpPercap_1992    1655.784158
gdpPercap_1997    2289.234136
gdpPercap_2002    3119.280896
gdpPercap_2007    4959.114854
Name: China, dtype: float64

In [47]:
# Would get the same result printing data.loc["China"] (without a second index).

data.loc["China"]

gdpPercap_1952     400.448611
gdpPercap_1957     575.987001
gdpPercap_1962     487.674018
gdpPercap_1967     612.705693
gdpPercap_1972     676.900092
gdpPercap_1977     741.237470
gdpPercap_1982     962.421380
gdpPercap_1987    1378.904018
gdpPercap_1992    1655.784158
gdpPercap_1997    2289.234136
gdpPercap_2002    3119.280896
gdpPercap_2007    4959.114854
Name: China, dtype: float64

In [45]:
# Would get a column data["gdpPercap_1952"]

data["gdpPercap_1952"]

# Also get the same result printing data.gdpPercap_1952 (since it’s a column name)
data.gdpPercap_1952

country
Afghanistan              779.445314
Bahrain                 9867.084765
Bangladesh               684.244172
Cambodia                 368.469286
China                    400.448611
Hong Kong China         3054.421209
India                    546.565749
Indonesia                749.681655
Iran                    3035.326002
Iraq                    4129.766056
Israel                  4086.522128
Japan                   3216.956347
Jordan                  1546.907807
Korea Dem. Rep.         1088.277758
Korea Rep.              1030.592226
Kuwait                108382.352900
Lebanon                 4834.804067
Malaysia                1831.132894
Mongolia                 786.566857
Myanmar                  331.000000
Nepal                    545.865723
Oman                    1828.230307
Pakistan                 684.597144
Philippines             1272.880995
Saudi Arabia            6459.554823
Singapore               2315.138227
Sri Lanka               1083.532030
Syria               

#### Select multiple columns or rows using `DataFrame.loc` and a named slice

In [51]:
# slice India to Israel, and include all columns by ':'
subset1 = data.loc["India":"Israel"]

# take all the rows and get data from 1972 to 1982
subset2 = data.loc[:, "gdpPercap_1972":"gdpPercap_1982"]

# slice rows, India to Israel, and columns, 1972 to 1982
subset3 = data.loc["India":"Israel", "gdpPercap_1972":"gdpPercap_1982"]

# in . all the cases, I have saved my subsets in different variables

Unnamed: 0_level_0,gdpPercap_1972,gdpPercap_1977,gdpPercap_1982
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
India,724.032527,813.337323,855.723538
Indonesia,1111.107907,1382.702056,1516.872988
Iran,9613.818607,11888.59508,7608.334602
Iraq,9576.037596,14688.23507,14517.90711
Israel,12786.93223,13306.61921,15367.0292


In the above code, we discover that slicing using loc is inclusive at both ends, which differs from slicing using iloc, where slicing indicates everything up to but not including the final index.

#### Result of slicing can be used in further operations

- Usually don’t just print a slice.
- All the statistical operators that work on entire dataframes work the same way on slices.
    - E.g., calculate max of a slice.

In [59]:
subset4 = subset3.max()
print(subset4)
subset4.min()

gdpPercap_1972    12786.93223
gdpPercap_1977    14688.23507
gdpPercap_1982    15367.02920
dtype: float64


12786.93223

In [65]:
# you can do multiple operations on your dataframe or its subset

print(subset3.T.iloc[-1])
subset3.T.iloc[-1].to_csv("test_subset.csv")

country
India          855.723538
Indonesia     1516.872988
Iran          7608.334602
Iraq         14517.907110
Israel       15367.029200
Name: gdpPercap_1982, dtype: float64


  after removing the cwd from sys.path.


#### Use comparisons to select data based on value

- Comparison is applied element by element.
- Returns a similarly-shaped dataframe of True and False.

In [74]:
# Use a subset of data to keep output readable.
subset3 #saved in my previous computation

# Which values were greater than 10000 ?
subset3>10000

Unnamed: 0_level_0,gdpPercap_1972,gdpPercap_1977,gdpPercap_1982
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
India,False,False,False
Indonesia,False,False,False
Iran,False,True,False
Iraq,False,True,True
Israel,True,True,True


#### Select values or NaN using a Boolean mask.

- A frame full of Booleans is sometimes called a mask because of how it can be used.

In [75]:
subset3[subset3>10000]

Unnamed: 0_level_0,gdpPercap_1972,gdpPercap_1977,gdpPercap_1982
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
India,,,
Indonesia,,,
Iran,,11888.59508,
Iraq,,14688.23507,14517.90711
Israel,12786.93223,13306.61921,15367.0292


- Get the value where the mask is true, and NaN (Not a Number) where it is false.
- Useful because NaNs are ignored by operations like max, min, average, etc.

In [82]:
subset3_stat = subset3[subset3>10000].describe()
subset3_stat

df.head()

Unnamed: 0_level_0,gdpPercap_1952,gdpPercap_1957,gdpPercap_1962,gdpPercap_1967,gdpPercap_1972,gdpPercap_1977,gdpPercap_1982,gdpPercap_1987,gdpPercap_1992,gdpPercap_1997,gdpPercap_2002,gdpPercap_2007
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
Afghanistan,779.445314,820.85303,853.10071,836.197138,739.981106,786.11336,978.011439,852.395945,649.341395,635.341351,726.734055,974.580338
Bahrain,9867.084765,11635.79945,12753.27514,14804.6727,18268.65839,19340.10196,19211.14731,18524.02406,19035.57917,20292.01679,23403.55927,29796.04834
Bangladesh,684.244172,661.637458,686.341554,721.186086,630.233627,659.877232,676.981866,751.979403,837.810164,972.770035,1136.39043,1391.253792
Cambodia,368.469286,434.038336,496.913648,523.432314,421.624026,524.972183,624.475478,683.895573,682.303175,734.28517,896.226015,1713.778686
China,400.448611,575.987001,487.674018,612.705693,676.900092,741.23747,962.42138,1378.904018,1655.784158,2289.234136,3119.280896,4959.114854


#### Exercise - Selection of Individual Values

Assume Pandas has been imported into your notebook and the Gapminder GDP data for Europe has been loaded:

In [83]:
import pandas

df = pandas.read_csv('data/gapminder_gdp_europe.csv', index_col='country')

Write an expression to find the Per Capita GDP of Serbia in 2007.

In [85]:
df.loc["Serbia", "gdpPercap_2007"]

9786.534714

#### Exercise - Extent of Slicing

- Do the two statements below produce the same output?
- Based on this, what rule governs what is included (or not) in numerical slices and named slices in Pandas?

```Python
print(data.iloc[0:2, 0:2])
print(data.loc['Albania':'Belgium', 'gdpPercap_1952':'gdpPercap_1962'])
```

In [90]:
#print(df.head())
print(df.iloc[0:2, 0:2])
print(df.loc['Albania':'Belgium', 'gdpPercap_1952':'gdpPercap_1962'])

         gdpPercap_1952  gdpPercap_1957
country                                
Albania     1601.056136     1942.284244
Austria     6137.076492     8842.598030
         gdpPercap_1952  gdpPercap_1957  gdpPercap_1962
country                                                
Albania     1601.056136     1942.284244     2312.888958
Austria     6137.076492     8842.598030    10750.721110
Belgium     8343.105127     9714.960623    10991.206760


#### Exercise -Reconstructing Data

- Explain what each line in the following short program does: what is in first, second, etc.?

In [117]:
first = pandas.read_csv('data/gapminder_all.csv', index_col='country')
second = first[first['continent'] == 'Americas']

#print(first.describe())

#print(second.loc['Puerto Rico'])

third = second.drop('Puerto Rico')
#print(third.loc['Puerto Rico'])

fourth = third.drop('continent', axis = "columns") # axis = ""
fourth.to_csv('result.csv')

#### Exercise - Selecting Indices

Explain in simple terms what `idxmin` and `idxmax` do in the short program below. When would you use these methods?

In [142]:
data = pandas.read_csv('data/gapminder_gdp_europe.csv', index_col='country')
#print(data.T.min().idxmin())
print(data.idxmin())
print()
print(data.idxmax())

gdpPercap_1952    Bosnia and Herzegovina
gdpPercap_1957    Bosnia and Herzegovina
gdpPercap_1962    Bosnia and Herzegovina
gdpPercap_1967    Bosnia and Herzegovina
gdpPercap_1972    Bosnia and Herzegovina
gdpPercap_1977    Bosnia and Herzegovina
gdpPercap_1982                   Albania
gdpPercap_1987                   Albania
gdpPercap_1992                   Albania
gdpPercap_1997                   Albania
gdpPercap_2002                   Albania
gdpPercap_2007                   Albania
dtype: object

gdpPercap_1952    Switzerland
gdpPercap_1957    Switzerland
gdpPercap_1962    Switzerland
gdpPercap_1967    Switzerland
gdpPercap_1972    Switzerland
gdpPercap_1977    Switzerland
gdpPercap_1982    Switzerland
gdpPercap_1987         Norway
gdpPercap_1992         Norway
gdpPercap_1997         Norway
gdpPercap_2002         Norway
gdpPercap_2007         Norway
dtype: object


#### Practice with Selection

Assume Pandas has been imported and the Gapminder GDP data for Europe has been loaded. Write an expression to select each of the following:

- GDP per capita for all countries in 1982.
- GDP per capita for Denmark for all years.
- GDP per capita for all countries for years after 1985.
- GDP per capita for each country in 2007 as a multiple of GDP per capita for that country in 1952.

In [148]:
df = pandas.read_csv('data/gapminder_all.csv', index_col='country')

df.head()

Unnamed: 0_level_0,continent,gdpPercap_1952,gdpPercap_1957,gdpPercap_1962,gdpPercap_1967,gdpPercap_1972,gdpPercap_1977,gdpPercap_1982,gdpPercap_1987,gdpPercap_1992,...,pop_1962,pop_1967,pop_1972,pop_1977,pop_1982,pop_1987,pop_1992,pop_1997,pop_2002,pop_2007
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Algeria,Africa,2449.008185,3013.976023,2550.81688,3246.991771,4182.663766,4910.416756,5745.160213,5681.358539,5023.216647,...,11000948.0,12760499.0,14760787.0,17152804.0,20033753.0,23254956.0,26298373.0,29072015.0,31287142,33333216
Angola,Africa,3520.610273,3827.940465,4269.276742,5522.776375,5473.288005,3008.647355,2756.953672,2430.208311,2627.845685,...,4826015.0,5247469.0,5894858.0,6162675.0,7016384.0,7874230.0,8735988.0,9875024.0,10866106,12420476
Benin,Africa,1062.7522,959.60108,949.499064,1035.831411,1085.796879,1029.161251,1277.897616,1225.85601,1191.207681,...,2151895.0,2427334.0,2761407.0,3168267.0,3641603.0,4243788.0,4981671.0,6066080.0,7026113,8078314
Botswana,Africa,851.241141,918.232535,983.653976,1214.709294,2263.611114,3214.857818,4551.14215,6205.88385,7954.111645,...,512764.0,553541.0,619351.0,781472.0,970347.0,1151184.0,1342614.0,1536536.0,1630347,1639131
Burkina Faso,Africa,543.255241,617.183465,722.512021,794.82656,854.735976,743.387037,807.198586,912.063142,931.752773,...,4919632.0,5127935.0,5433886.0,5889574.0,6634596.0,7586551.0,8878303.0,10352843.0,12251209,14326203


In [149]:
df.gdpPercap_1982

country
Algeria            5745.160213
Angola             2756.953672
Benin              1277.897616
Botswana           4551.142150
Burkina Faso        807.198586
                      ...     
Switzerland       28397.715120
Turkey             4241.356344
United Kingdom    18232.424520
Australia         19477.009280
New Zealand       17632.410400
Name: gdpPercap_1982, Length: 142, dtype: float64

In [150]:
df.loc['Denmark']

continent              Europe
gdpPercap_1952        9692.39
gdpPercap_1957        11099.7
gdpPercap_1962        13583.3
gdpPercap_1967        15937.2
gdpPercap_1972        18866.2
gdpPercap_1977        20422.9
gdpPercap_1982          21688
gdpPercap_1987        25116.2
gdpPercap_1992        26406.7
gdpPercap_1997        29804.3
gdpPercap_2002        32166.5
gdpPercap_2007        35278.4
lifeExp_1952            70.78
lifeExp_1957            71.81
lifeExp_1962            72.35
lifeExp_1967            72.96
lifeExp_1972            73.47
lifeExp_1977            74.69
lifeExp_1982            74.63
lifeExp_1987             74.8
lifeExp_1992            75.33
lifeExp_1997            76.11
lifeExp_2002            77.18
lifeExp_2007           78.332
pop_1952            4.334e+06
pop_1957          4.48783e+06
pop_1962           4.6469e+06
pop_1967           4.8388e+06
pop_1972           4.9916e+06
pop_1977          5.08842e+06
pop_1982          5.11781e+06
pop_1987          5.12702e+06
pop_1992  

In [151]:
df.loc[:, 'gdpPercap_1985':]

Unnamed: 0_level_0,gdpPercap_1987,gdpPercap_1992,gdpPercap_1997,gdpPercap_2002,gdpPercap_2007,lifeExp_1952,lifeExp_1957,lifeExp_1962,lifeExp_1967,lifeExp_1972,...,pop_1962,pop_1967,pop_1972,pop_1977,pop_1982,pop_1987,pop_1992,pop_1997,pop_2002,pop_2007
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Algeria,5681.358539,5023.216647,4797.295051,5288.040382,6223.367465,43.077,45.685,48.303,51.407,54.518,...,11000948.0,12760499.0,14760787.0,17152804.0,20033753.0,23254956.0,26298373.0,29072015.0,31287142,33333216
Angola,2430.208311,2627.845685,2277.140884,2773.287312,4797.231267,30.015,31.999,34.000,35.985,37.928,...,4826015.0,5247469.0,5894858.0,6162675.0,7016384.0,7874230.0,8735988.0,9875024.0,10866106,12420476
Benin,1225.856010,1191.207681,1232.975292,1372.877931,1441.284873,38.223,40.358,42.618,44.885,47.014,...,2151895.0,2427334.0,2761407.0,3168267.0,3641603.0,4243788.0,4981671.0,6066080.0,7026113,8078314
Botswana,6205.883850,7954.111645,8647.142313,11003.605080,12569.851770,47.622,49.618,51.520,53.298,56.024,...,512764.0,553541.0,619351.0,781472.0,970347.0,1151184.0,1342614.0,1536536.0,1630347,1639131
Burkina Faso,912.063142,931.752773,946.294962,1037.645221,1217.032994,31.975,34.906,37.814,40.697,43.591,...,4919632.0,5127935.0,5433886.0,5889574.0,6634596.0,7586551.0,8878303.0,10352843.0,12251209,14326203
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Switzerland,30281.704590,31871.530300,32135.323010,34480.957710,37506.419070,69.620,70.560,71.320,72.770,73.780,...,5666000.0,6063000.0,6401400.0,6316424.0,6468126.0,6649942.0,6995447.0,7193761.0,7361757,7554661
Turkey,5089.043686,5678.348271,6601.429915,6508.085718,8458.276384,43.585,48.079,52.098,54.336,57.005,...,29788695.0,33411317.0,37492953.0,42404033.0,47328791.0,52881328.0,58179144.0,63047647.0,67308928,71158647
United Kingdom,21664.787670,22705.092540,26074.531360,29478.999190,33203.261280,69.180,70.420,70.760,71.360,72.010,...,53292000.0,54959000.0,56079000.0,56179000.0,56339704.0,56981620.0,57866349.0,58808266.0,59912431,60776238
Australia,21888.889030,23424.766830,26997.936570,30687.754730,34435.367440,69.120,70.330,70.930,71.100,71.930,...,10794968.0,11872264.0,13177000.0,14074100.0,15184200.0,16257249.0,17481977.0,18565243.0,19546792,20434176


In [156]:
df_2007_multi_of_1952 = df['gdpPercap_2007']/df['gdpPercap_1952']
df_2007_multi_of_1952

country
Algeria            2.541179
Angola             1.362614
Benin              1.356182
Botswana          14.766499
Burkina Faso       2.240260
                    ...    
Switzerland        2.545529
Turkey             4.295502
United Kingdom     3.327144
Australia          3.429956
New Zealand        2.385718
Length: 142, dtype: float64

### Keypoints

- Use `DataFrame.iloc[..., ...]` to select values by integer location.
- Use `:` on its own to mean all columns or all rows.
- Select multiple columns or rows using `DataFrame.loc` and a named slice.
- Result of slicing can be used in further operations.
- Use comparisons to select data based on value.
- Select values or NaN using a Boolean mask.

In [None]:
#A demonstration of how you can create and modify a dataframe

import pandas

In [None]:
#Create an empty df with one column 'X'
df = pandas.DataFrame(columns = ["X"]) 

In [None]:
#Create an empty df with two columns 'X' and 'Y'
df1 = pandas.DataFrame(columns = ["X", "Y"]) 

In [None]:
#Create an empty df with two columns 'X' and 'Y' and 1 row 'a'
df2 = pandas.DataFrame(columns = ["X", "Y"], index=['a'])

In [None]:
#Create an empty df with two columns 'X' and 'Y' and multiple rows 'a', 'b', 'c'
df3 = pandas.DataFrame(columns = ["X", "Y"], index=['a', 'b', 'c'])

In [None]:
#Create a df with two columns 'X' and 'Y' and multiple rows 'a', 'b', 'c', and initialize cells with 0s or 1s
df4 = pandas.DataFrame(0, columns = ["X", "Y"], index=['a', 'b', 'c'])

#insert a value for the row a
df4.loc['a'] = 1 # this will add 1 for both the cells in the row 'a'
df4.loc['a'] = [1, 12] # this will add 1 for the cell X and 12 for the cell Y in the row a

In [None]:
#insert a value for the column X
df4['X'] = 1 # this will add 1 for all the cells in the column 'X'
df4['X'] = [1, 20, 24] # this will add different values in the cells of the column 'X'

In [None]:
#insert a value in the soecific cell
df4.loc['c', 'Y'] = 34

In [None]:
# you can use iloc similarly
df4.iloc[1, 1] = 29

In [None]:
# initialize a new column with values
df4['Z'] = [22, 38, 44]

In [None]:
# add an empty column 
df4['ZZ'] = 0

In [None]:
#sort data by first column
df4.sort_index()

#sort in reverse order
df4.sort_index(ascending=False)

#sort by a defined column
df4.sort_values(by='Y')

#sort by multiple columns
df4.sort_values(by=['Y', 'Z']) 

# REMINDER: you can always pass multiple values for the commands in pandas using [] brackets

In [None]:
#create a mask for values more than 1 in the column X
df4[df4['X']>1]

#masking by multiple values
df4[df4['X']>1 & (df4['Y']<30)]

#check null values
df4.isnull()

#replacing values
df5 = df4.replace(0, '66')
df5['ZZ'] = ['col1', 'col2', 'col3']
df5

In [None]:
df6 = df5.set_index('ZZ')

In [None]:
#groupby

df6 = df5.groupby('X')['Y'].mean() #group by the column 'X' and get a mean of the values in 'Y'
df6

In [None]:
# create a new dataframe ndf

ndf = pandas.DataFrame([['gene1', 299], ['gene2', 599], ['gene3', 678]], index=['col1', 'col2', 'col3'], columns=['Gene', 'Length'])
ndf

In [None]:
# merge df6 and ndf

df6_ndf_1 = pandas.concat([df6, ndf]) 
df6_ndf_1
# by default the concatanation of the dataframe happens in the axis 0, or rows

In [None]:
# merge df6 and ndf using the column axis (axis=1)

df6_ndf_2 = pandas.concat([df6, ndf], axis=1)
df6_ndf_2

# pandas 

In [None]:
# Optionally use join

df6_ndf_3 = ndf.join(df6)
df6_ndf_3

In [None]:
# tryout the commands from your cheatsheets here

