# Pandas DataFrame UltraQuick Tutorial 
Practice work from additional resources section of Google ML course
Contains extra steps, pointers from related Pandas documentation
Link to original Cola notebook at:
https://colab.research.google.com/github/google/eng-edu/blob/master/ml/cc/exercises/pandas_dataframe_ultraquick_tutorial.ipynb?utm_source=mlcc&utm_campaign=colab-external&utm_medium=referral&utm_content=pandas_tf2-colab&hl=en#scrollTo=YDu2VotPgzsW

This notebook introduces DataFrames, which are the central data structure in the pandas API. This is not a comprehensive DataFrames tutorial. Rather, it provides a very quick introduction to the parts of DataFrames required to do the other exercises in Machine Learning Crash Course from Google Education.

A DataFrame is similar to an in-memory spreadsheet. Like a spreadsheet:

A DataFrame stores data in cells.
A DataFrame has named columns (usually) and numbered rows.

# Import NumPy and pandas modules
Run the following code cell to import the NumPy and pandas modules.

In [1]:
import numpy as np
import pandas as pd

# Creating a DataFrame
The following code cell creates a simple DataFrame containing 10 cells organized as follows:

5 rows
2 columns, one named temperature and the other named activity
The following code cell instantiates a pd.DataFrame class to generate a DataFrame. The class takes two arguments:

The first argument provides the data to populate the 10 cells. The code cell calls np.array to generate the 5x2 NumPy array.
The second argument identifies the names of the two columns.

In [13]:
# Create and populate a 5x2 NumPy array.
my_data = np.array([[0, 3], [10, 7], [20, 9], [30, 14], [40, 15],[55, 17]])

# Create a Python list that holds the names of the two columns.
my_column_names = ['temperature', 'activity']

# Create a DataFrame.
my_dataframe = pd.DataFrame(data=my_data, columns=my_column_names)

# Print the entire DataFrame
print(my_dataframe)

   temperature  activity
0            0         3
1           10         7
2           20         9
3           30        14
4           40        15
5           55        17


# Adding a new column to a DataFrame
You may add a new column to an existing pandas DataFrame just by assigning values to a new column name. For example, the following code creates a third column named adjusted in my_dataframe:

In [14]:
# Create a new column named adjusted.
my_dataframe["adjusted"] = my_dataframe["activity"] + 2

# Print the entire DataFrame
print(my_dataframe)

   temperature  activity  adjusted
0            0         3         5
1           10         7         9
2           20         9        11
3           30        14        16
4           40        15        17
5           55        17        19


# Specifying a subset of a DataFrame
Pandas provide multiples ways to isolate specific rows, columns, slices or cells in a DataFrame.

In [15]:
print("Rows #0, #1, and #2:")
print(my_dataframe.head(3),'\n', "------------",'\n')

print("Row #1:")
print(my_dataframe.iloc[[1]], '\n', "------------",'\n','\n')

print("Rows #2, #3, #4 and #5:")
print(my_dataframe[2:6], '\n', "------------",'\n')

print("Column 'temperature':")
print(my_dataframe['temperature'])

Rows #0, #1, and #2:
   temperature  activity  adjusted
0            0         3         5
1           10         7         9
2           20         9        11 
 ------------ 

Row #1:
   temperature  activity  adjusted
1           10         7         9 
 ------------ 
 

Rows #2, #3, #4 and #5:
   temperature  activity  adjusted
2           20         9        11
3           30        14        16
4           40        15        17
5           55        17        19 
 ------------ 

Column 'temperature':
0     0
1    10
2    20
3    30
4    40
5    55
Name: temperature, dtype: int64


# Task 1: Create a DataFrame
Do the following:

Create an 3x4 (3 rows x 4 columns) pandas DataFrame in which the columns are named Eleanor, Chidi, Tahani, and Jason. Populate each of the 12 cells in the DataFrame with a random integer between 0 and 100, inclusive.

Output the following:

the entire DataFrame
the value in the cell of row #1 of the Eleanor column
Create a fifth column named Janet, which is populated with the row-by-row sums of Tahani and Jason.

To complete this task, it helps to know the NumPy basics covered in the NumPy UltraQuick Tutorial.

In [17]:
# Create and populate a 5x2 NumPy array.
mat = np.array([[0, 3, 10, 7], [20, 9, 30, 14], [40, 15, 55, 17]])

# Create a Python list that holds the names of the two columns.
cols = ['Eleanor', 'Chidi','Tahani','Jason']

# Create a DataFrame.
df = pd.DataFrame(data=mat, columns=cols)

# Print the entire DataFrame
print(df,'\n','\n')
print(df['Eleanor'].iloc[[1]])
df['Janet']=df['Tahani']+df['Jason']
print('\n')
print(df)

   Eleanor  Chidi  Tahani  Jason
0        0      3      10      7
1       20      9      30     14
2       40     15      55     17 
 

1    20
Name: Eleanor, dtype: int64


   Eleanor  Chidi  Tahani  Jason  Janet
0        0      3      10      7     17
1       20      9      30     14     44
2       40     15      55     17     72


# Copying a DataFrame (optional)
Pandas provides two different ways to duplicate a DataFrame:

Referencing. If you assign a DataFrame to a new variable, any change to the DataFrame or to the new variable will be reflected in the other.
Copying. If you call the pd.DataFrame.copy method, you create a true independent copy. Changes to the original DataFrame or to the copy will not be reflected in the other.
The difference is subtle, but important.

In [18]:
# Create a reference by assigning my_dataframe to a new variable.
print("Experiment with a reference:")
reference_to_df = df

# Print the starting value of a particular cell.
print("  Starting value of df: %d" % df['Jason'][1])
print("  Starting value of reference_to_df: %d\n" % reference_to_df['Jason'][1])

# Modify a cell in df.
df.at[1, 'Jason'] = df['Jason'][1] + 5
print("  Updated df: %d" % df['Jason'][1])
print("  Updated reference_to_df: %d\n\n" % reference_to_df['Jason'][1])

# Create a true copy of my_dataframe
print("Experiment with a true copy:")
copy_of_my_dataframe = my_dataframe.copy()

# Print the starting value of a particular cell.
print("  Starting value of my_dataframe: %d" % my_dataframe['activity'][1])
print("  Starting value of copy_of_my_dataframe: %d\n" % copy_of_my_dataframe['activity'][1])

# Modify a cell in df.
my_dataframe.at[1, 'activity'] = my_dataframe['activity'][1] + 3
print("  Updated my_dataframe: %d" % my_dataframe['activity'][1])
print("  copy_of_my_dataframe does not get updated: %d" % copy_of_my_dataframe['activity'][1])

Experiment with a reference:
  Starting value of df: 14
  Starting value of reference_to_df: 14

  Updated df: 19
  Updated reference_to_df: 19


Experiment with a true copy:
  Starting value of my_dataframe: 7
  Starting value of copy_of_my_dataframe: 7

  Updated my_dataframe: 10
  copy_of_my_dataframe does not get updated: 7


# Experimenting with previous self created dataframe for practice

In [21]:
df_copy_by_ref = df
#Actual copy
df_copy = df.copy()
print("Original dataframe:", '\n', df, '\n')
print("Dataframe Copy by Reference:", '\n', df_copy_by_ref, '\n')
print("Dataframe Copy by copy() Method:", '\n', df_copy, '\n')

Original dataframe: 
    Eleanor  Chidi  Tahani  Jason  Janet
0        0      3      10      7     17
1       20      9      30     19     44
2       40     15      55     17     72 

Dataframe Copy by Reference: 
    Eleanor  Chidi  Tahani  Jason  Janet
0        0      3      10      7     17
1       20      9      30     19     44
2       40     15      55     17     72 

Dataframe Copy by copy() Method: 
    Eleanor  Chidi  Tahani  Jason  Janet
0        0      3      10      7     17
1       20      9      30     19     44
2       40     15      55     17     72 



In [23]:
#add a column to original dataframe
df['Sam']=df['Eleanor']*(df['Eleanor']-10)
print("Original dataframe:", '\n', df, '\n')
print("Dataframe Copy by Reference:", '\n', df_copy_by_ref, '\n')
print("Dataframe Copy by copy() Method:", '\n', df_copy, '\n')

Original dataframe: 
    Eleanor  Chidi  Tahani  Jason  Janet   Sam
0        0      3      10      7     17     0
1       20      9      30     19     44   200
2       40     15      55     17     72  1200 

Dataframe Copy by Reference: 
    Eleanor  Chidi  Tahani  Jason  Janet   Sam
0        0      3      10      7     17     0
1       20      9      30     19     44   200
2       40     15      55     17     72  1200 

Dataframe Copy by copy() Method: 
    Eleanor  Chidi  Tahani  Jason  Janet
0        0      3      10      7     17
1       20      9      30     19     44
2       40     15      55     17     72 

