## CM4044: AI In Chemistry
## Semester 1 2020/21

<hr>

## Tutorial 3a: Introduction to DataFrame in Pandas Part I
## Objectives
### $\bullet$ Pandas Library
### $\bullet$ Create DataFrame Object and Simple Data Descriptions
### $\bullet$ Data Selection and Filtering
### $\bullet$ Add Index, Row or Column
### $\bullet$ Reset Index and Delete Row or Column
### $\bullet$ Rename Index or Column
### $\bullet$ Deal with Missing Data
### $\bullet$ Iterate Over Rows and Columns
### $\bullet$ File Input and Output

<hr>



Pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and **data analysis** tools for the Python programming language. The name Pandas is derived from the word Panel Data – an Econometrics from Multidimensional data.

The official website of Pandas is [here](https://pandas.pydata.org/). You can also find a lot of tutorial resources from [here](http://pandas.pydata.org/pandas-docs/stable/getting_started/tutorials.html).


`pandas` has three data structures：

- `Series`
    - a one dimensional data structure (“a one dimensional ndarray”) that can store values — and for every value it holds a unique index, too.
- `DataFrame`
    - a two (or more) dimensional data structure – basically a table with rows and columns. The columns have names and the rows have indexes.

- `Panel`
    - General 3D labeled, also size-mutable array
    
Among the three data structures, `DataFrame` is the most popular in usage. 

In general, you can think that the Pandas `DataFrame` consists of three main components: the data, the index, and the columns.

Firstly, the DataFrame can contain data that is:

    1. a Pandas `DataFrame`
    2. a Pandas `Series`: a one-dimensional labeled array capable of holding any data type with axis labels or index. 
      An example of a Series object is one column from a DataFrame.
    3. a NumPy `ndarray`, which can be a record or structured
    4. a two-dimensional `ndarray`, or structured array.
    5. dictionaries of one-dimensional ndarray’s, lists, dictionaries or Series.

Besides data, you can also specify the index and column names for your DataFrame. The index, on the one hand, indicates the difference in rows, while the column names indicate the difference in columns. You will see later that these two components of the DataFrame will come in handy when you’re manipulating your data.

Finally, consider the structure of `DataFrame` object, these real data can be well loaded to a `DataFrame` object:

    --In a school system DataFrame – each row could represent a single student in the school, 
      and columns may represent the students name (string), age (number), date of birth (date), and address (string).
    --In an economics DataFrame, each row may represent a single city or geographical area, 
      and columns might include the the name of area (string), the population (number), the average age of the population (number), 
      the number of households (number), the number of schools in each area (number) etc.
    --In a shop or e-commerce system DataFrame, each row in a DataFrame may be used to represent a customer, 
      where there are columns for the number of items purchased (number), the date of original registration (date),
      and the credit card number (string).

## 1. Import Pandas

To use Pandas library, one has to import it. Here is a common way to do this:

In [1]:
import numpy as np
import pandas as pd

print(pd.__version__)

pd.set_option('max_columns', 50)   # set maximum column in data structure to be 50, the real value varies in cases.


1.0.1


## 2. How To Create a `DataFrame` Object?

A `DataFrame` object can be constructed by calling the constructor, `DataFrame(data, index, columns)`.

And many things can be passed in to the data argument.

`index` specifies the `label` of rows，`columns` specifies the `label` of columns。If the `index` and/or `columns` have no input, the default values will be used.

Using the `columns` parameter allows us to tell the constructor how we'd like the columns ordered. By default, the `DataFrame` constructor will order the columns **alphabetically**. But if the `DataFrame` is created from reading a data file, it will follow the order in the data file.

Below are some examples:

In [2]:
# construct a DataFrame object from a numpy array
data = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
df0 = pd.DataFrame(data,columns=['a', 'b', 'c'])
df0

Unnamed: 0,a,b,c
0,1,2,3
1,4,5,6
2,7,8,9


In [3]:
data = {'year': [2010, 2011, 2012, 2011, 2012, 2010, 2011, 2012],
        'team': ['Bears', 'Bears', 'Bears', 'Packers', 'Packers', 'Lions', 'Lions', 'Lions'],
        'wins': [11, 8, 10, 15, 11, 6, 10, 4],
        'losses': [5, 8, 6, 1, 5, 10, 6, 12]}  # a dictionary
#football = pd.DataFrame(data, columns=['year', 'team', 'wins', 'losses'])
football = pd.DataFrame(data)
football

Unnamed: 0,year,team,wins,losses
0,2010,Bears,11,5
1,2011,Bears,8,8
2,2012,Bears,10,6
3,2011,Packers,15,1
4,2012,Packers,11,5
5,2010,Lions,6,10
6,2011,Lions,10,6
7,2012,Lions,4,12


`DataFrame.shape` will provide you with the dimensions of your `DataFrame`. That means that you will get to know the width and the height of your `DataFrame`. 

In [4]:
print(football.shape)  # 8 rows and 4 columns

(8, 4)


You can get the `DataFrame`’s row labels with `.index` and its column labels with `.columns`.



In [5]:
print(football.index)  

print(football.columns) 

RangeIndex(start=0, stop=8, step=1)
Index(['year', 'team', 'wins', 'losses'], dtype='object')


If you want more information on `DataFrame.index` or `DataFrame.columns`, you can check on the `.values` of these 

In [6]:
print(football.index.values)

[0 1 2 3 4 5 6 7]


In [7]:
print(football.columns.values)

['year' 'team' 'wins' 'losses']


Now you have the row and column labels as special kinds of sequences. You can get a single item:

In [8]:
print(football.index[1])   #1

print(football.columns[1])  # team

1
team


## 2.1 `head()`, `tail()` and `sample()`

When a `DataFrame` object loads huge number of data, you can use **dot notation** to call `head(n)` to show the first `n` rows of data, (5 rows is the default value):

In [9]:
football.head(3)

Unnamed: 0,year,team,wins,losses
0,2010,Bears,11,5
1,2011,Bears,8,8
2,2012,Bears,10,6


You can check the last `n` rows by calling `tail(n)` (5 is the default value):

In [10]:
football.tail(5)  # check last five rows

Unnamed: 0,year,team,wins,losses
3,2011,Packers,15,1
4,2012,Packers,11,5
5,2010,Lions,6,10
6,2011,Lions,10,6
7,2012,Lions,4,12


or randomly check `n` rows by calling `sample(n)` (1 is the default value):

In [11]:
football.sample(4)   # randomly check 4 rows

Unnamed: 0,year,team,wins,losses
5,2010,Lions,6,10
4,2012,Packers,11,5
6,2011,Lions,10,6
0,2010,Bears,11,5


## 2.2 `describe()`, `T` operator, `sort_index()` and `sort_value()`

`DataFrame.describe()` provides a quick statistical summary of your data including **count**, **mean**, **std**, **min**, **25%**, **50%**, **75%** and **max** of the data.Note that `describe()` method only applys to those columns with numbers.

In [12]:
football.describe()  # it automatically handles columns of numbers

Unnamed: 0,year,wins,losses
count,8.0,8.0,8.0
mean,2011.125,9.375,6.625
std,0.834523,3.377975,3.377975
min,2010.0,4.0,1.0
25%,2010.75,7.5,5.0
50%,2011.0,10.0,6.0
75%,2012.0,11.0,8.5
max,2012.0,15.0,12.0


You can use `T` operator to **transpose** rows and columns in a `DataFrame` object:

In [13]:
football.T

Unnamed: 0,0,1,2,3,4,5,6,7
year,2010,2011,2012,2011,2012,2010,2011,2012
team,Bears,Bears,Bears,Packers,Packers,Lions,Lions,Lions
wins,11,8,10,15,11,6,10,4
losses,5,8,6,1,5,10,6,12


Pandas `dataframe.sort_index()` function sorts objects by **labels** along the given axis.
Basically the sorting alogirthm is applied on the axis labels rather than the actual data in the dataframe object and based on that the data is rearranged. We have the freedom to choose what sorting algorithm we would like to apply. There are three possible sorting algorithms that we can use ‘quicksort’, ‘mergesort’ and ‘heapsort’.

In [14]:
football.sort_index(axis=1,ascending=False) # sort by column name by alphabatic order

Unnamed: 0,year,wins,team,losses
0,2010,11,Bears,5
1,2011,8,Bears,8
2,2012,10,Bears,6
3,2011,15,Packers,1
4,2012,11,Packers,5
5,2010,6,Lions,10
6,2011,10,Lions,6
7,2012,4,Lions,12


In [15]:
football.sort_index(axis=0, ascending=False)   # sort by row count along axis = 0 direction

Unnamed: 0,year,team,wins,losses
7,2012,Lions,4,12
6,2011,Lions,10,6
5,2010,Lions,6,10
4,2012,Packers,11,5
3,2011,Packers,15,1
2,2012,Bears,10,6
1,2011,Bears,8,8
0,2010,Bears,11,5


You can also sort a `DataFrame` object by values in a specified column by calling `DataFrame.sort_values(column_label)`:

In [16]:
football.sort_values('wins',ascending=False)

Unnamed: 0,year,team,wins,losses
3,2011,Packers,15,1
0,2010,Bears,11,5
4,2012,Packers,11,5
2,2012,Bears,10,6
6,2011,Lions,10,6
1,2011,Bears,8,8
5,2010,Lions,6,10
7,2012,Lions,4,12


You can sort values by multiple columns:

In [17]:
# sort by two columns, firstly sort by wins on descending order
# if there are two rows have the same wins, sort by year on descending order.
football.sort_values(['wins', 'year'],ascending=[False,False])

Unnamed: 0,year,team,wins,losses
3,2011,Packers,15,1
4,2012,Packers,11,5
0,2010,Bears,11,5
2,2012,Bears,10,6
6,2011,Lions,10,6
1,2011,Bears,8,8
5,2010,Lions,6,10
7,2012,Lions,4,12


## 3 How to Select Data from `DataFrame` Object?

There are several ways to select parts of data from a `DataFrame` object. 

    1. use `[]` operator with labels of columns or rows
    2. use dot notation to access columns by its labels. 
    3. use `at[]`, `iat[]`,`loc[]`, and `iloc[]` through dot notation.

### 3.1 `[]` Operator

It is easy to use `[]` to select a column with its label, the returned object is a `Series` object. Remember `Dataframe` is a collection of one or several `Series` objects.

In [18]:
print(football['wins'])   
print(type(football['wins']))

0    11
1     8
2    10
3    15
4    11
5     6
6    10
7     4
Name: wins, dtype: int64
<class 'pandas.core.series.Series'>


The above operation is the same as `df.wins`:

In [19]:
print(football.wins) # preferred

0    11
1     8
2    10
3    15
4    11
5     6
6    10
7     4
Name: wins, dtype: int64


**But if you apply the same way to select a row, it does not work.** The use of [] and row labels can normally provide a slice of data, including the end point:

In [20]:
football.T['year':'team']  # transpose the table football to create row label and select a range.

Unnamed: 0,0,1,2,3,4,5,6,7
year,2010,2011,2012,2011,2012,2010,2011,2012
team,Bears,Bears,Bears,Packers,Packers,Lions,Lions,Lions


In [21]:
football.T['year':'year'] # a `DataFrame` object with one row of data

Unnamed: 0,0,1,2,3,4,5,6,7
year,2010,2011,2012,2011,2012,2010,2011,2012


## 3.2 `loc[]` and `iloc`, `at[]` and `iat[]`

These functions are called accessors to Pandas `DataFrame` for data selection.

A preferred operation to select data is to use `loc[]` or `iloc[]` of a `DataFrame` object. 

`.loc[]` accepts the labels of rows and columns and returns `Series` or `DataFrames`. You can use it to get entire rows or columns, as well as their parts.

`.iloc[]` accepts the zero-based indices of rows and columns and returns `Series` or `DataFrames`. You can use it to get entire rows or columns, or their parts.

`.at[]` accepts the labels of rows and columns and returns a single data value.

`.iat[]` accepts the zero-based indices of rows and columns and returns a single data value.

The returned data is stored in a `Series` object.

First, let us check some examples to use **label based indexing**, `loc[]`:

In [22]:
print(football.loc[3])    # use the default row index 

year         2011
team      Packers
wins           15
losses          1
Name: 3, dtype: object


In [23]:
fb1 = football.T  # create a new dataframe object, as football does not have row label.
print(fb1)
print(fb1.loc['year'])  # row data indexed by label saved in dates[0
print(type(fb1.loc['year']))  # the data type is a Series object

            0      1      2        3        4      5      6      7
year     2010   2011   2012     2011     2012   2010   2011   2012
team    Bears  Bears  Bears  Packers  Packers  Lions  Lions  Lions
wins       11      8     10       15       11      6     10      4
losses      5      8      6        1        5     10      6     12
0    2010
1    2011
2    2012
3    2011
4    2012
5    2010
6    2011
7    2012
Name: year, dtype: object
<class 'pandas.core.series.Series'>


In [24]:
print(fb1.at['team',0])   # row label and column to 

Bears


In [25]:
print(fb1.iat[0,1])     #

2011


A `DataFrame` object holds tablular data, so you can use both row and column labels to index the value(s):

In [26]:
print(football.loc[1:3,['year','team']])

   year     team
1  2011    Bears
2  2012    Bears
3  2011  Packers


The selection of data can also be performed by positional indexing through `iloc[]`. Below are the examples:

In [27]:
fb1.iloc[3]    # select the fourth row and return a `Series` object

0     5
1     8
2     6
3     1
4     5
5    10
6     6
7    12
Name: losses, dtype: object

Slcing by integer position index, the usage is similar to numpy/python style, the ending point is excluded:

In [28]:
fb1.iloc[2:5, 0:2]  # ending point excluded!

Unnamed: 0,0,1
wins,11,8
losses,5,8


You can also select data with lists of integer position locations:

In [29]:
fb1.iloc[[0, 3, 2], [0, 2]]    # use position index in lists for row and column

Unnamed: 0,0,2
year,2010,2012
losses,5,6
wins,11,10


### 3.3 Data selection by conditions (data filtering)

Data filtering is another powerful feature of Pandas to select data base on certain conditions. It works similarly to indexing with Boolean arrays in numpy array.

In [30]:
wins_check = football['wins'] >= 8

print(type(wins_check))   # a Series object
print(wins_check)

<class 'pandas.core.series.Series'>
0     True
1     True
2     True
3     True
4     True
5    False
6     True
7    False
Name: wins, dtype: bool


You can use the return boolean `Series` object, wins_check, to retrieve a `DataFrame` with rows satisfiying conditions:

In [31]:
football[wins_check]

Unnamed: 0,year,team,wins,losses
0,2010,Bears,11,5
1,2011,Bears,8,8
2,2012,Bears,10,6
3,2011,Packers,15,1
4,2012,Packers,11,5
6,2011,Lions,10,6


You can use the combined conditions to generate new `DataFrame`:

In [32]:
football[(football['wins']>=8) & (football['year'] > 2010)]

Unnamed: 0,year,team,wins,losses
1,2011,Bears,8,8
2,2012,Bears,10,6
3,2011,Packers,15,1
4,2012,Packers,11,5
6,2011,Lions,10,6


## 4 How To Add an Index, Row or Column to a Pandas `DataFrame`?

## 4.1 Adding an Index to a `DataFrame` object

When you create a `DataFrame` object, you have the option to add input to the `index` argument to make sure that you have the index that you desire. When you don’t specify this, your `DataFrame` object will have, by default, a numerically valued index that starts with 0 and continues until the last row of your DataFrame.

However, even when your index is specified for you automatically, you still have the power to re-use one of your columns and make it your index. You can easily do this by calling `set_index()` on your `DataFrame` object. 

In [33]:
df1 = pd.DataFrame({'month': [1, 4, 7, 10],
                   'year': [2012, 2014, 2013, 2014],
                   'sale':[55, 40, 84, 31]})

print(df1.index)   # the default index is numbers from 0 
print(df1)
print()

df2 = df1.set_index('month') # creates a new copy of object refered by df4
print(df2.index)
print(df2)
print()

df3 = df1.set_index(['year','month']) # creates a new copy of object refered by df4  
print(df3.index)
print(df3)

RangeIndex(start=0, stop=4, step=1)
   month  year  sale
0      1  2012    55
1      4  2014    40
2      7  2013    84
3     10  2014    31

Int64Index([1, 4, 7, 10], dtype='int64', name='month')
       year  sale
month            
1      2012    55
4      2014    40
7      2013    84
10     2014    31

MultiIndex([(2012,  1),
            (2014,  4),
            (2013,  7),
            (2014, 10)],
           names=['year', 'month'])
            sale
year month      
2012 1        55
2014 4        40
2013 7        84
2014 10       31


## 4.2 Adding a row to `DataFrame` object

You can add a row by calling `.loc[]` of a `DataFrame` object:

In [34]:
df4 = pd.DataFrame(data=np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]), index= [2.5, 12.6, 4.8], columns=[48, 49, 50])
print(df4)
print()

df4.loc[2] = [11, 12, 13]   # the parameter 2 is used as the index of this row
print(df4)
print()

      48  49  50
2.5    1   2   3
12.6   4   5   6
4.8    7   8   9

      48  49  50
2.5    1   2   3
12.6   4   5   6
4.8    7   8   9
2.0   11  12  13



## 4.3 Adding a column to `DataFrame` object

A simple way to add a column is using `[]` operator with column label. For example:

In [35]:
data = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
df5 = pd.DataFrame(data, columns=['A', 'B', 'C'])
print(df5)

print()

# Use [] operator
df5['D'] = df5.index  # numbers from 0 to total row number - 1


print(df5)

   A  B  C
0  1  2  3
1  4  5  6
2  7  8  9

   A  B  C  D
0  1  2  3  0
1  4  5  6  1
2  7  8  9  2


Or you can use `.loc[]` or `.iloc[]` to add a column:

In [36]:
data = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
df6 = pd.DataFrame(data, columns=['A', 'B', 'C'])
print(df6)

print()

# Append a column to `df`
# a column of DataFrame is a Series data type, so it is nature to assign a column to a Series object.
df6.loc[:, 'E'] = pd.Series(['5', '6','7'])

# Print out `df` again to see the changes
print(df6)

   A  B  C
0  1  2  3
1  4  5  6
2  7  8  9

   A  B  C  E
0  1  2  3  5
1  4  5  6  6
2  7  8  9  7


## 5 How to Reset Indices, Delete Rows or Columns From a Pandas `DataFrame`?

If you want to remove the index from your DataFrame, you should reconsider because `DataFrame` and `Series` always have an index. But you can do these things with index:

    1. resetting the index of your DataFrame
    2. remove the index name, if there is any, by executing `del df.index.name`
    3. remove duplicate index values by resetting the index, dropping the duplicates of the index column 
      that has been added to your DataFrame and reinstating that duplicateless column again as the index.

## 5.1 Reset indices

Firstly, when your index doesn't look entirely the way you want it to, you can opt to reset it. You can easily do this with `.reset_index()`. However, you should still watch out, as you can pass several arguments that can make or break the success of your reset:

In [37]:
df = pd.DataFrame(data=np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]), index= [2.5, 12.6, 4.8], columns=[48, 49, 50])
print(df)
print()

df_reset = df.reset_index(level = 0, drop=True)
print(df_reset)

      48  49  50
2.5    1   2   3
12.6   4   5   6
4.8    7   8   9

   48  49  50
0   1   2   3
1   4   5   6
2   7   8   9


Another way to achieve the above is to reassign the attribute `.index` directly, which is faster

In [38]:
df.index = range(len(df.index))  # also can be achieved by df5.index = pd.RangeIndex(len(df5.index))
print(df)

   48  49  50
0   1   2   3
1   4   5   6
2   7   8   9


We can also set index from a column of data to replace the default

In [39]:
df = pd.DataFrame({'month': [1, 4, 7, 10],
                   'year': [2012, 2014, 2013, 2014],
                   'sale':[55, 40, 84, 31]})
print(df)
print()

df.set_index('month',inplace=True) 
print(df)
print()

df.index.name = None   # set it to nothing for display
print(df)

   month  year  sale
0      1  2012    55
1      4  2014    40
2      7  2013    84
3     10  2014    31

       year  sale
month            
1      2012    55
4      2014    40
7      2013    84
10     2014    31

    year  sale
1   2012    55
4   2014    40
7   2013    84
10  2014    31


In [40]:
# there are duplicates of index: 2.5 and 2.5, 4.8 and 4.8
df = pd.DataFrame(data=np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9], [40, 50, 60], [23, 35, 37]]), 
    index= [2.5, 12.6, 4.8, 4.8, 2.5], 
    columns=[48, 49, 50])
print(df)

print()


# remove the duplicate index and only keep rows at the last
df_reset=df.reset_index().drop_duplicates(subset='index', keep='last').set_index('index')
print(df_reset)

      48  49  50
2.5    1   2   3
12.6   4   5   6
4.8    7   8   9
4.8   40  50  60
2.5   23  35  37

       48  49  50
index            
12.6    4   5   6
4.8    40  50  60
2.5    23  35  37


## 5.2 Delete rows

You can remove duplicate rows from your `DataFrame` object by executing `DataFrame.drop_duplicates()`.

In [41]:
data = {"Name": ["James", "Alice", "Phil", "James"],
        "Age": [24, 28, 40, 24],
        "Sex": ["Male", "Female", "Male", "Male"]}
df = pd.DataFrame(data)
print(df)
print()
df = df.drop_duplicates()   
print(df)

    Name  Age     Sex
0  James   24    Male
1  Alice   28  Female
2   Phil   40    Male
3  James   24    Male

    Name  Age     Sex
0  James   24    Male
1  Alice   28  Female
2   Phil   40    Male


If there is no uniqueness criterion to the deletion that you want to perform, you can use the `DataFrame.drop()` method, where you use the index property to specify the index of which rows you want to remove from your `DataFrame` object.

By default, `DataFrame.drop()` returns a new `DataFrame` with the specified rows removed. If you pass `inplace=True`, the original `DataFrame` will be modified and you’ll get `None` as the return value.

In [42]:
# there are duplicates of index: 2.5 and 2.5, 4.8 and 4.8
df = pd.DataFrame(data=np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9], [40, 50, 60], [23, 35, 37]]), 
    index= [2.5, 12.6, 4.8, 4.8, 2.5], 
    columns=[48, 49, 50])
print(df)

print()

df.drop(df.index[0], inplace=True) # the inplace = True means no need to reassign the DataFrame to new variable

print(df)

      48  49  50
2.5    1   2   3
12.6   4   5   6
4.8    7   8   9
4.8   40  50  60
2.5   23  35  37

      48  49  50
12.6   4   5   6
4.8    7   8   9
4.8   40  50  60


## 5.4 Delete columns

To delete (a selection of) columns from your `DataFrame` object, you can use the `DataFrame.drop()` method with specification `axis=1`:

In [43]:
# there are duplicates of index: 2.5 and 2.5, 4.8 and 4.8
df = pd.DataFrame(data=np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9], [40, 50, 60], [23, 35, 37]]), 
    index= [2.5, 12.6, 4.8, 4.8, 2.5], 
    columns=[48, 49, 50])
print(df)

print()

df.drop(df.columns[0], axis=1, inplace=True) # the inplace = True means no need to reassign the DataFrame to new variable

print(df)

      48  49  50
2.5    1   2   3
12.6   4   5   6
4.8    7   8   9
4.8   40  50  60
2.5   23  35  37

      49  50
2.5    2   3
12.6   5   6
4.8    8   9
4.8   50  60
2.5   35  37


In [44]:
data = {"Name": ["James", "Alice", "Phil", "James"],
        "Age": [24, 28, 40, 24],
        "Sex": ["Male", "Female", "Male", "Male"]}
df = pd.DataFrame(data)
print(df)
print()
df2 = df.drop('Age',axis=1)  # sometimes you don't want to change the old data table
print(df)
print()
print(df2)

    Name  Age     Sex
0  James   24    Male
1  Alice   28  Female
2   Phil   40    Male
3  James   24    Male

    Name  Age     Sex
0  James   24    Male
1  Alice   28  Female
2   Phil   40    Male
3  James   24    Male

    Name     Sex
0  James    Male
1  Alice  Female
2   Phil    Male
3  James    Male


## 6 How to Rename the Index or Columns of a Pandas `DataFrame`?

To give the columns or your index values of your  a different value, it’s best to use the `DataFrame.rename()` method.

In [45]:
# there are duplicates of index: 2.5 and 2.5, 4.8 and 4.8
df = pd.DataFrame(data=np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9], [40, 50, 60], [23, 35, 37]]), 
    index= [2.5, 12.6, 4.8, 4.8, 2.5], 
    columns=[48, 49, 50])
print(df)
print()

new_columns = {48:'col1', 49:'col2', 50:'col3'}
df.rename(columns=new_columns,inplace=True)

print(df)
print()

new_index = {2.5:'A', 12.6:'B', 4.8:"C"}
df.rename(index=new_index, inplace=True)

print(df)

      48  49  50
2.5    1   2   3
12.6   4   5   6
4.8    7   8   9
4.8   40  50  60
2.5   23  35  37

      col1  col2  col3
2.5      1     2     3
12.6     4     5     6
4.8      7     8     9
4.8     40    50    60
2.5     23    35    37

   col1  col2  col3
A     1     2     3
B     4     5     6
C     7     8     9
C    40    50    60
A    23    35    37


It is very useful to rename index or column with a `lambda` function, a facile approach. A `lambda` function is one-line anoymous function.

In [46]:
# there are duplicates of index: 2.5 and 2.5, 4.8 and 4.8
df5 = pd.DataFrame(data=np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9], [40, 50, 60], [23, 35, 37]]), 
    index= [2.5, 12.6, 4.8, 4.8, 2.5], 
    columns=['ABC', 'DEF', 'GHI'])
print(df5)
print()

df5.rename(columns = lambda x: x.lower().replace(' ', '_'), inplace=True)
print(df5)


      ABC  DEF  GHI
2.5     1    2    3
12.6    4    5    6
4.8     7    8    9
4.8    40   50   60
2.5    23   35   37

      abc  def  ghi
2.5     1    2    3
12.6    4    5    6
4.8     7    8    9
4.8    40   50   60
2.5    23   35   37


## 7 Dealing with Missing Data

Missing data is very common in data science and machine learning. In a table of a large number of data, some fields have no data because of the difficulties in data collections. Pandas primarily uses the value `np.nan` to represent missing data. It is by default not included in computations. `DataFrame` has three basic methods to deal with missing data: `DataFrame.dropna()`, `DataFrame.fillna()`, and `DataFrame.isna()`:

In [47]:
df = pd.DataFrame({'x': [1, 2, np.nan, 4]})
df

Unnamed: 0,x
0,1.0
1,2.0
2,
3,4.0


`DataFrame.isna()` creates a boolean `DataFrame` to report whether a cell missing data or not.

In [48]:
df1 = df.isna()
print(df1)

       x
0  False
1  False
2   True
3  False


Many Pandas methods omit `NaN` values when performing calculations unless they are explicitly instructed not to:

In [49]:
print(df.mean())
print(df.mean(skipna=False))

x    2.333333
dtype: float64
x   NaN
dtype: float64


`DataFrame.fillna(val)` replace the missing data cell with `val`.

In [50]:
df1 = df.fillna(5.0)   # fillna(5.0) creates a new DataFrame object
print(df1)
print()
print(df)

     x
0  1.0
1  2.0
2  5.0
3  4.0

     x
0  1.0
1  2.0
2  NaN
3  4.0


`DataFrame.fillna(method='ffill')` replaces `NaN` by the value above it and `DataFrame.fillna(method='bfill')` replaces `NaN` by the value below it:

In [51]:
df.fillna(method='ffill')

Unnamed: 0,x
0,1.0
1,2.0
2,2.0
3,4.0


In [52]:
df.fillna(method='bfill')

Unnamed: 0,x
0,1.0
1,2.0
2,4.0
3,4.0


Another popular option is to apply interpolation and replace missing values with interpolated values. You can do this with `.interpolate()`:

In [53]:
df1 = df.interpolate()
print(df1)

     x
0  1.0
1  2.0
2  3.0
3  4.0


These methods modify the orginial `DataFrame` when `inplace=True`:

In [54]:
df.interpolate(inplace=True)  #modify df directly
print(df)

     x
0  1.0
1  2.0
2  3.0
3  4.0


## 8 Iterating Over DataFrame Object

As you learned earlier, a `DataFrame`’s row and column labels can be retrieved as sequences with `.index` and `.columns`. You can use this feature to iterate over labels and get or set data values. However, Pandas provides several more convenient methods for iteration:

- `.items()` to iterate over columns
- `.iteritems()` to iterate over columns
- `.iterrows()` to iterate over rows
- `.itertuples()` to iterate over rows and get named tuples

With `.items()` and `.iteritems()`, you iterate over the columns of a Pandas DataFrame. Each iteration yields a tuple with the name of the column and the column data as a `Series` object:



In [55]:
 for col_label, col in football.iteritems():
        print(col_label, col, sep='\n', end='\n\n')

year
0    2010
1    2011
2    2012
3    2011
4    2012
5    2010
6    2011
7    2012
Name: year, dtype: int64

team
0      Bears
1      Bears
2      Bears
3    Packers
4    Packers
5      Lions
6      Lions
7      Lions
Name: team, dtype: object

wins
0    11
1     8
2    10
3    15
4    11
5     6
6    10
7     4
Name: wins, dtype: int64

losses
0     5
1     8
2     6
3     1
4     5
5    10
6     6
7    12
Name: losses, dtype: int64



With `.iterrows()`, you iterate over the rows of a Pandas `DataFrame`. Each iteration yields a tuple with the name of the row and the row data as a `Series` object:

In [56]:
for row_label, row in football.iterrows():
    print(row_label, row, sep='\n', end='\n\n')

0
year       2010
team      Bears
wins         11
losses        5
Name: 0, dtype: object

1
year       2011
team      Bears
wins          8
losses        8
Name: 1, dtype: object

2
year       2012
team      Bears
wins         10
losses        6
Name: 2, dtype: object

3
year         2011
team      Packers
wins           15
losses          1
Name: 3, dtype: object

4
year         2012
team      Packers
wins           11
losses          5
Name: 4, dtype: object

5
year       2010
team      Lions
wins          6
losses       10
Name: 5, dtype: object

6
year       2011
team      Lions
wins         10
losses        6
Name: 6, dtype: object

7
year       2012
team      Lions
wins          4
losses       12
Name: 7, dtype: object



Similarly, `.itertuples()` iterates over the rows and in each iteration yields a named tuple with (optionally) the index and data:

In [57]:
# the default name is Pandas, here changed to football
# index set to False, so the tuple produced from iteraction does not include index of row.
for row in football.itertuples(name='football', index=False):   
    print(row)

football(year=2010, team='Bears', wins=11, losses=5)
football(year=2011, team='Bears', wins=8, losses=8)
football(year=2012, team='Bears', wins=10, losses=6)
football(year=2011, team='Packers', wins=15, losses=1)
football(year=2012, team='Packers', wins=11, losses=5)
football(year=2010, team='Lions', wins=6, losses=10)
football(year=2011, team='Lions', wins=10, losses=6)
football(year=2012, team='Lions', wins=4, losses=12)


## 9 How to Save DataFrame and Read Data From File?

Pandas does support a lot of file format (see below) and a `DataFrame` object can be saved to harddisk in one of these format. Reversely, data recorded in these formated can be read to a `DataFrame` object in computer memory. Sometimes, there functions are called the Pandas I/O:

| Format Type	| Data Description	| Reader 	 | Writer |
| ------------ | ----------------- | ---------  | ------ |
| text	| CSV	| read_csv()	| to_csv() |
| text	| JSON	| read_json()	| to_json() |
| text	| HTML	| read_html()	| to_html() |
| text	| Local clipboard	| read_clipboard()	| to_clipboard() |
| binary	| MS Excel	| read_excel()	| to_excel() |
| binary	| HDF5 Format	| read_hdf()	| to_hdf() |
| binary	| Feather Format	| read_feather()	| to_feather() |
| binary	| Parquet Format	| read_parquet()	| to_parquet() |
| binary	|Msgpack	| read_msgpack()	| to_msgpack() |
| binary	| Stata	| read_stata()	| to_stata() |
| binary	| SAS	| read_sas()	 | |
| binary	| Python Pickle Format	| read_pickle()	| to_pickle() |
| SQL	| SQL	| read_sql()	| to_sql() |
| SQL	| Google Big Query	| read_gbq()	| to_gbq() |

For example, you can save the `football` object into a `CSV` file:

In [58]:
# save the football object to the file test.csv in the current working directory
# set index = False to avoid writing row index to csv file
football.to_csv('football.csv', index=False)   

Or you can read `football.csv` just created in the current diectory and load data to a `DataFrame` object:

In [59]:
fb1 = pd.read_csv('football.csv') 

fb1

Unnamed: 0,year,team,wins,losses
0,2010,Bears,11,5
1,2011,Bears,8,8
2,2012,Bears,10,6
3,2011,Packers,15,1
4,2012,Packers,11,5
5,2010,Lions,6,10
6,2011,Lions,10,6
7,2012,Lions,4,12


Since the csv files are mostly used to train machine learning model, `pd.read_csv()` is commonly used at the start of a machine learning project. It is important to know that there are a lot of options to `pd.read_csv()` method. 

`pandas.read_csv(filepath_or_buffer: Union[str, pathlib.Path, IO[~AnyStr]], sep=',', delimiter=None, header='infer', names=None, index_col=None, usecols=None, squeeze=False, prefix=None, mangle_dupe_cols=True, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skipinitialspace=False, skiprows=None, skipfooter=0, nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, skip_blank_lines=True, parse_dates=False, infer_datetime_format=False, keep_date_col=False, date_parser=None, dayfirst=False, cache_dates=True, iterator=False, chunksize=None, compression='infer', thousands=None, decimal: str = '.', lineterminator=None, quotechar='"', quoting=0, doublequote=True, escapechar=None, comment=None, encoding=None, dialect=None, error_bad_lines=True, warn_bad_lines=True, delim_whitespace=False, low_memory=True, memory_map=False, float_precision=None)`

The details about these options can be found [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html).
