# Pandas DataFrame UltraQuick Tutorial

This Colab introduces [**DataFrames**](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html), which are the central data structure in the pandas API. This Colab is not a comprehensive DataFrames tutorial.  Rather, this Colab provides a very quick introduction to the parts of DataFrames required to do the other Colab exercises in Machine Learning Crash Course.

A DataFrame is similar to an in-memory spreadsheet. Like a spreadsheet:

  * A DataFrame stores data in cells. 
  * A DataFrame has named columns (usually) and numbered rows.
  
Corresponding Collab notebook - https://colab.research.google.com/github/google/eng-edu/blob/master/ml/cc/exercises/pandas_dataframe_ultraquick_tutorial.ipynb

In [1]:
import pandas as pd 
import numpy as np


### Create a DataFrame

In [2]:
# Create and populate a 5x2 NumPy array.
my_data = np.array([[0, 3], [10, 7], [20, 9], [30, 14], [40, 15]])

# Create a Python list that holds the names of the two columns.
my_column_names = ['temperature', 'activity']

# Create a DataFrame.
my_dataframe = pd.DataFrame(data=my_data, columns=my_column_names)

# Print the entire DataFrame
print(my_dataframe)

   temperature  activity
0            0         3
1           10         7
2           20         9
3           30        14
4           40        15


### Adding a new column to a DataFrame

In [3]:
# Create a new column named adjusted.
my_dataframe["adjusted"] = my_dataframe["activity"] + 2

# Print the entire DataFrame
print(my_dataframe)

   temperature  activity  adjusted
0            0         3         5
1           10         7         9
2           20         9        11
3           30        14        16
4           40        15        17


### Specifying a subset of a DataFrame

In [4]:
print("Rows #0, #1, and #2:")
print(my_dataframe.head(3), '\n')

print("Row #2:")
print(my_dataframe.iloc[[2]], '\n')

print("Rows #1, #2, and #3:")
print(my_dataframe[1:4], '\n')

print("Column 'temperature':")
print(my_dataframe['temperature'])

Rows #0, #1, and #2:
   temperature  activity  adjusted
0            0         3         5
1           10         7         9
2           20         9        11 

Row #2:
   temperature  activity  adjusted
2           20         9        11 

Rows #1, #2, and #3:
   temperature  activity  adjusted
1           10         7         9
2           20         9        11
3           30        14        16 

Column 'temperature':
0     0
1    10
2    20
3    30
4    40
Name: temperature, dtype: int32


### Exercise 1 - Do the following:

Create an 3x4 (3 rows x 4 columns) pandas DataFrame in which the columns are named Eleanor, Chidi, Tahani, and Jason. Populate each of the 12 cells in the DataFrame with a random integer between 0 and 100, inclusive.

Output the following:

the entire DataFrame
the value in the cell of row #1 of the Eleanor column
Create a fifth column named Janet, which is populated with the row-by-row sums of Tahani and Jason.

To complete this task, it helps to know the NumPy basics covered in the NumPy UltraQuick Tutorial.

In [5]:
my_columns1 = ['Eleanor', 'Chidi', 'Tahani', 'Jason']

my_data1 = np.random.randint(0, 101, size=(3,4))

# Create a DataFrame.
my_dataframe1 = pd.DataFrame(data=my_data1, columns=my_columns1)
# the entire DataFrame
print(my_dataframe1)

# the value in the cell of row #1 of the Eleanor column

print(my_dataframe1['Eleanor'][1])

# Create a fifth column named Janet, which is populated with the row-by-row sums of Tahani and Jason.
my_dataframe1['Janet'] = my_dataframe1['Tahani'] + my_dataframe1['Jason'] 
print(my_dataframe1)

   Eleanor  Chidi  Tahani  Jason
0       68     16      60      8
1       41     58      92     64
2       31     53      21     26
41
   Eleanor  Chidi  Tahani  Jason  Janet
0       68     16      60      8     68
1       41     58      92     64    156
2       31     53      21     26     47


## Copying a DataFrame (optional)

Pandas provides two different ways to duplicate a DataFrame:

* **Referencing.** If you assign a DataFrame to a new variable, any change to the DataFrame or to the new variable will be reflected in the other. 
* **Copying.** If you call the `pd.DataFrame.copy` method, you create a true independent copy.  Changes to the original DataFrame or to the copy will not be reflected in the other. 

The difference is subtle, but important.

In [6]:
# Create a reference by assigning my_dataframe to a new variable.
print("Experiment with a reference:")
reference_to_df = my_dataframe1

# Print the starting value of a particular cell.
print("  Starting value of df: %d" % my_dataframe1['Jason'][1])
print("  Starting value of reference_to_df: %d\n" % reference_to_df['Jason'][1])

# Modify a cell in df.
my_dataframe1.at[1, 'Jason'] = my_dataframe1['Jason'][1] + 5
print("  Updated df: %d" % my_dataframe1['Jason'][1])
print("  Updated reference_to_df: %d\n\n" % reference_to_df['Jason'][1])

# Create a true copy of my_dataframe
print("Experiment with a true copy:")
copy_of_my_dataframe = my_dataframe.copy()

# Print the starting value of a particular cell.
print("  Starting value of my_dataframe: %d" % my_dataframe['activity'][1])
print("  Starting value of copy_of_my_dataframe: %d\n" % copy_of_my_dataframe['activity'][1])

# Modify a cell in df.
my_dataframe.at[1, 'activity'] = my_dataframe['activity'][1] + 3
print("  Updated my_dataframe: %d" % my_dataframe['activity'][1])
print("  copy_of_my_dataframe does not get updated: %d" % copy_of_my_dataframe['activity'][1])

Experiment with a reference:
  Starting value of df: 64
  Starting value of reference_to_df: 64

  Updated df: 69
  Updated reference_to_df: 69


Experiment with a true copy:
  Starting value of my_dataframe: 7
  Starting value of copy_of_my_dataframe: 7

  Updated my_dataframe: 10
  copy_of_my_dataframe does not get updated: 7


### Next steps 

https://pandas.pydata.org/pandas-docs/stable/getting_started/index.html#getting-started

https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html