# pandas - Python Data Analysis Library
`pandas` is a software library written for data manipulation and analysis. It contains the `DataFrame` object for manipulating numerical tables and time series data. The dataframe in pandas combines aspects of MATLAB indexing with functionality similar to the statistical programming language `R`.
First, since `pandas` is an auxiliary library, you must allows load it.

In [6]:
import pandas as pd

`pandas` provides an easy to use function `read_table`.

In [33]:
# read table
dftemps = pd.read_table('data/GlobalTempbyMonth.txt', header=None, index_col=0, sep='\s+')
print(type(dftemps))

<class 'pandas.core.frame.DataFrame'>


In [34]:
# show data
dftemps

Unnamed: 0_level_0,1,2,3,4,5,6,7,8,9,10,11
0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1850/01,-0.700,-0.757,-0.627,-0.889,-0.510,-1.047,-0.352,-0.898,-0.500,-1.100,-0.299
1850/02,-0.286,-0.342,-0.218,-0.474,-0.098,-0.625,0.052,-0.482,-0.082,-0.676,0.111
1850/03,-0.732,-0.799,-0.680,-0.934,-0.529,-1.009,-0.454,-0.942,-0.519,-1.080,-0.382
1850/04,-0.563,-0.620,-0.493,-0.767,-0.360,-0.816,-0.310,-0.777,-0.349,-0.894,-0.231
1850/05,-0.327,-0.386,-0.279,-0.521,-0.132,-0.589,-0.064,-0.533,-0.128,-0.661,0.001
...,...,...,...,...,...,...,...,...,...,...,...
2017/11,0.552,0.509,0.590,0.513,0.591,0.441,0.663,0.494,0.610,0.427,0.678
2017/12,0.600,0.542,0.640,0.560,0.640,0.475,0.725,0.533,0.658,0.458,0.738
2018/01,0.554,0.507,0.603,0.527,0.581,0.410,0.699,0.503,0.611,0.402,0.710
2018/02,0.522,0.469,0.569,0.495,0.550,0.404,0.640,0.468,0.579,0.393,0.654


Notice that the dataframe has row and column headers that can be either strings or numbers. The index for the rows in the dataframe `dftemps` appears as the first column of this file - the dates. The index of a dataframe is also bolded when printed. Dataframes must have a column containing unique values for every row; the index of a dataframe is this unique column.

In [37]:
# show index
dftemps.index

Index(['1850/01', '1850/02', '1850/03', '1850/04', '1850/05', '1850/06',
       '1850/07', '1850/08', '1850/09', '1850/10',
       ...
       '2017/06', '2017/07', '2017/08', '2017/09', '2017/10', '2017/11',
       '2017/12', '2018/01', '2018/02', '2018/03'],
      dtype='object', name=0, length=2019)

`pandas` has a function for reading excel files, `read_excel`

In [10]:
dfcarbon = pd.read_excel('data/GlobalCarbonBudget2022.xlsx','Global Carbon Budget', header=0,  skiprows=20)

In [11]:
dfcarbon

Unnamed: 0,Year,fossil emissions excluding carbonation,land-use change emissions,atmospheric growth,ocean sink,land sink,cement carbonation sink,budget imbalance
0,1959,2.417091,1.938933,2.03904,0.975005,0.401805,0.012684,0.927490
1,1960,2.562137,1.792600,1.50804,0.959013,1.234131,0.013835,0.639717
2,1961,2.570540,1.666500,1.65672,0.805321,0.839819,0.014723,0.920457
3,1962,2.661315,1.608267,1.18944,0.895229,1.322231,0.015872,0.846810
4,1963,2.803399,1.542733,1.21068,1.059837,0.871917,0.016867,1.186831
...,...,...,...,...,...,...,...,...
58,2017,9.851730,1.182300,4.54536,2.854828,3.555004,0.202927,-0.124088
59,2018,10.050902,1.141200,5.03388,2.947231,3.647435,0.209702,-0.646145
60,2019,10.120786,1.243800,5.43744,2.994756,3.041949,0.214002,-0.323561
61,2020,9.624478,1.107467,4.99140,2.998115,3.105137,0.220563,-0.583270


In [12]:
dfcarbon.columns

Index(['Year', 'fossil emissions excluding carbonation',
       'land-use change emissions', 'atmospheric growth', 'ocean sink',
       'land sink', 'cement carbonation sink', 'budget imbalance'],
      dtype='object')

In [13]:
name_dict = {}
for name in dfcarbon.columns:
    namelist = name.split(' ')
    name_dict[name] = ' '.join(namelist[0:2])
print(name_dict)

{'Year': 'Year', 'fossil emissions excluding carbonation': 'fossil emissions', 'land-use change emissions': 'land-use change', 'atmospheric growth': 'atmospheric growth', 'ocean sink': 'ocean sink', 'land sink': 'land sink', 'cement carbonation sink': 'cement carbonation', 'budget imbalance': 'budget imbalance'}


In [14]:
newdfcarbon = dfcarbon.rename(columns = name_dict)

In [15]:
newdfcarbon

Unnamed: 0,Year,fossil emissions,land-use change,atmospheric growth,ocean sink,land sink,cement carbonation,budget imbalance
0,1959,2.417091,1.938933,2.03904,0.975005,0.401805,0.012684,0.927490
1,1960,2.562137,1.792600,1.50804,0.959013,1.234131,0.013835,0.639717
2,1961,2.570540,1.666500,1.65672,0.805321,0.839819,0.014723,0.920457
3,1962,2.661315,1.608267,1.18944,0.895229,1.322231,0.015872,0.846810
4,1963,2.803399,1.542733,1.21068,1.059837,0.871917,0.016867,1.186831
...,...,...,...,...,...,...,...,...
58,2017,9.851730,1.182300,4.54536,2.854828,3.555004,0.202927,-0.124088
59,2018,10.050902,1.141200,5.03388,2.947231,3.647435,0.209702,-0.646145
60,2019,10.120786,1.243800,5.43744,2.994756,3.041949,0.214002,-0.323561
61,2020,9.624478,1.107467,4.99140,2.998115,3.105137,0.220563,-0.583270


## Indexing into a dataframe (aka slicing a dataframe)
The two methods that are essential to accessing information in a dataframe are `loc` and `iloc`. `loc` takes the row and column names as strings. `iloc` takes only integers that label the rows and columns. Using `iloc` allows you to index a dataframe using only numbers.

In [None]:
dftemps.loc['2018/03']

In [None]:
dftemps.loc['2018/03',1]

In [None]:
dftemps.iloc[:,3]

In [None]:
dfcarbon['ocean sink'][3:5]

In [None]:
dfcarbon.loc[0,'ocean sink']

In [None]:
dfcarbon.iloc[:,2]

In [None]:
dfcarbon.Year

## Adding to Dataframes
You can add columns to a dataframe with the following syntax. Notice that `pandas` has a number of methods associated with dataframes, such as `mean`, `min`, `max`. These methods are meant to act directly on the dataframe. Other statistical tools are available directly from pandas, you can visit [Dataframe statistical methods](https://studyopedia.com/pandas/statistical-functions/). The basic statistical tools you have are
| Method      | Description                                      |
|-------------|--------------------------------------------------|
| sum()       | Return the sum of the values.                    |
| count()     | Return the count of non-empty values.            |
| max()       | Return the maximum of the values.                |
| min()       | Return the minimum of the values.                |
| mean()      | Return the mean of the values.                   |
| median()    | Return the median of the values.                 |
| std()       | Return the standard deviation of the values.     |
| describe()  | Return the summary statistics for each column.   |


In [38]:
# add new columns
dftemps['average'] = dftemps.mean(axis=1)
dftemps['min'] = dftemps.min(axis=1)
dftemps['max'] = dftemps.max(axis=1)

TypeError: unsupported operand type(s) for +: 'float' and 'str'

In [39]:
dftemps['average']

KeyError: 'average'

In [18]:
# sort by average
dftemps2 = dftemps.sort_values(by='average')

In [19]:
dftemps2

Unnamed: 0_level_0,1,2,3,4,5,6,7,8,9,10,11,average,min,max
0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
1893/01,-0.974,-1.057,-0.913,-1.082,-0.866,-1.286,-0.663,-1.111,-0.850,-1.316,-0.641,-0.978091,-1.316,-0.641
1864/01,-0.941,-1.014,-0.873,-1.134,-0.748,-1.472,-0.410,-1.147,-0.736,-1.511,-0.372,-0.941636,-1.511,-0.372
1861/01,-0.893,-0.960,-0.833,-1.066,-0.721,-1.336,-0.451,-1.081,-0.710,-1.375,-0.416,-0.894727,-1.375,-0.416
1862/12,-0.887,-0.978,-0.812,-1.071,-0.702,-1.319,-0.455,-1.093,-0.691,-1.368,-0.415,-0.890091,-1.368,-0.415
1917/03,-0.832,-0.910,-0.745,-0.926,-0.738,-1.024,-0.640,-0.952,-0.709,-1.058,-0.604,-0.830727,-1.058,-0.604
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2016/01,0.934,0.889,0.981,0.908,0.959,0.792,1.075,0.883,0.989,0.784,1.085,0.934455,0.784,1.085
2016/04,0.937,0.892,0.977,0.913,0.961,0.804,1.070,0.885,0.985,0.794,1.079,0.936091,0.794,1.079
2015/12,1.024,0.975,1.064,0.998,1.050,0.908,1.140,0.968,1.073,0.894,1.149,1.022091,0.894,1.149
2016/03,1.106,1.062,1.144,1.082,1.129,0.979,1.232,1.053,1.152,0.965,1.239,1.103909,0.965,1.239


In [20]:
# get a certain column
dftemps2.average

0
1893/01   -0.978091
1864/01   -0.941636
1861/01   -0.894727
1862/12   -0.890091
1917/03   -0.830727
             ...   
2016/01    0.934455
2016/04    0.936091
2015/12    1.022091
2016/03    1.103909
2016/02    1.110182
Name: average, Length: 2019, dtype: float64

In [21]:
newdfcarbon['total carbon'] = newdfcarbon['fossil emissions'] + newdfcarbon['land-use change'] 

In [22]:
newdfcarbon

Unnamed: 0,Year,fossil emissions,land-use change,atmospheric growth,ocean sink,land sink,cement carbonation,budget imbalance,total carbon
0,1959,2.417091,1.938933,2.03904,0.975005,0.401805,0.012684,0.927490,4.356024
1,1960,2.562137,1.792600,1.50804,0.959013,1.234131,0.013835,0.639717,4.354737
2,1961,2.570540,1.666500,1.65672,0.805321,0.839819,0.014723,0.920457,4.237040
3,1962,2.661315,1.608267,1.18944,0.895229,1.322231,0.015872,0.846810,4.269582
4,1963,2.803399,1.542733,1.21068,1.059837,0.871917,0.016867,1.186831,4.346133
...,...,...,...,...,...,...,...,...,...
58,2017,9.851730,1.182300,4.54536,2.854828,3.555004,0.202927,-0.124088,11.034030
59,2018,10.050902,1.141200,5.03388,2.947231,3.647435,0.209702,-0.646145,11.192102
60,2019,10.120786,1.243800,5.43744,2.994756,3.041949,0.214002,-0.323561,11.364586
61,2020,9.624478,1.107467,4.99140,2.998115,3.105137,0.220563,-0.583270,10.731944


## Reforming a Dataframe
Part of data wrangling is manipulating data into the most useful form. Suppose we didn't want the average temperature for every month, but every year. You can use `groupby` to group data in a different way. *Can you describe what happens in each of the following lines?*

In [35]:
# group by year and calculate the average temperature
dftemps['year'] = list(map(lambda x:x[:4], dftemps.index))
year_average = dftemps.groupby(dftemps.year).average.mean()
dftemps.set_index(dftemps.year)
dftemps

AttributeError: 'DataFrameGroupBy' object has no attribute 'average'

## Reordering a Dataframe
You can reorder the columns by using the `iloc` method and listing the column indices in the new order.

In [None]:
newdfcarbon.iloc[:,[0,1,2,8,3,4,5,6,7]]

In [None]:
newdfcarbon

## Writing to a File
After manipulating any data using a dataframe, you can write your modified table to a `csv`, `excel`, or `json` file easily. 