In [None]:
#@title Copyright 2020 Google LLC. Double-click here for license information.
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Pandas DataFrame UltraQuick Tutorial

This Colab introduces [**DataFrames**](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html), which are the central data structure in the pandas API. This Colab is not a comprehensive DataFrames tutorial.  Rather, this Colab provides a very quick introduction to the parts of DataFrames required to do the other Colab exercises in Machine Learning Crash Course.

A DataFrame is similar to an in-memory spreadsheet. Like a spreadsheet:

  * A DataFrame stores data in cells.
  * A DataFrame has named columns (usually) and numbered rows.

## Import NumPy and pandas modules

Run the following code cell to import the NumPy and pandas modules.

In [None]:
import numpy as np
import pandas as pd

## Creating a DataFrame

The following code cell creates a simple DataFrame containing 10 cells organized as follows:

  * 5 rows
  * 2 columns, one named `temperature` and the other named `activity`

The following code cell instantiates a `pd.DataFrame` class to generate a DataFrame. The class takes two arguments:

  * The first argument provides the data to populate the 10 cells. The code cell calls `np.array` to generate the 5x2 NumPy array.
  * The second argument identifies the names of the two columns.

**Note**: Do not redefine variables in the following code cell. Subsequent code cells use these variables.

In [None]:
# Create and populate a 5x2 NumPy array.
my_data = np.array([[0, 3], [10, 7], [20, 9], [30, 14], [40, 15]])

# Create a Python list that holds the names of the two columns.
my_column_names = ['temperature', 'activity']

# Create a DataFrame.
my_dataframe = pd.DataFrame(data=my_data, columns=my_column_names)

# Print the entire DataFrame
print(my_dataframe)

## Adding a new column to a DataFrame

You may add a new column to an existing pandas DataFrame just by assigning values to a new column name. For example, the following code creates a third column named `adjusted` in `my_dataframe`:

In [None]:
# Create a new column named adjusted.
my_dataframe["adjusted"] = my_dataframe["activity"] + 2

# Print the entire DataFrame
print(my_dataframe)

## Specifying a subset of a DataFrame

Pandas provide multiples ways to isolate specific rows, columns, slices or cells in a DataFrame.

In [None]:
print("Rows #0, #1, and #2:")
print(my_dataframe.head(3), '\n')

print("Row #2:")
print(my_dataframe.iloc[[2]], '\n')

print("Rows #1, #2, and #3:")
print(my_dataframe[1:4], '\n')

print("Column 'temperature':")
print(my_dataframe['temperature'])

## Task 1: Create a DataFrame

Do the following:

  1. Create an 3x4 (3 rows x 4 columns) pandas DataFrame in which the columns are named `Eleanor`,  `Chidi`, `Tahani`, and `Jason`.  Populate each of the 12 cells in the DataFrame with a random integer between 0 and 100, inclusive.

  2. Output the following:

     * the entire DataFrame
     * the value in the cell of row #1 of the `Eleanor` column

  3. Create a fifth column named `Janet`, which is populated with the row-by-row sums of `Tahani` and `Jason`.

To complete this task, it helps to know the NumPy basics covered in the NumPy UltraQuick Tutorial.


In [None]:
# Write your code here.

In [None]:
#@title Double-click for a solution to Task 1.

# Create a Python list that holds the names of the four columns.
my_column_names = ['Eleanor', 'Chidi', 'Tahani', 'Jason']

# Create a 3x4 numpy array, each cell populated with a random integer.
my_data = np.random.randint(low=0, high=101, size=(3, 4))

# Create a DataFrame.
df = pd.DataFrame(data=my_data, columns=my_column_names)

# Print the entire DataFrame
print(df)

# Print the value in row #1 of the Eleanor column.
print("\nSecond row of the Eleanor column: %d\n" % df['Eleanor'][1])

# Create a column named Janet whose contents are the sum
# of two other columns.
df['Janet'] = df['Tahani'] + df['Jason']

# Print the enhanced DataFrame
print(df)

## Copying a DataFrame (optional)

Pandas provides two different ways to duplicate a DataFrame:

* **Referencing.** If you assign a DataFrame to a new variable, any change to the DataFrame or to the new variable will be reflected in the other.
* **Copying.** If you call the `pd.DataFrame.copy` method, you create a true independent copy.  Changes to the original DataFrame or to the copy will not be reflected in the other.

The difference is subtle, but important.

In [None]:
# Create a reference by assigning my_dataframe to a new variable.
print("Experiment with a reference:")
reference_to_df = df

# Print the starting value of a particular cell.
print("  Starting value of df: %d" % df['Jason'][1])
print("  Starting value of reference_to_df: %d\n" % reference_to_df['Jason'][1])

# Modify a cell in df.
df.at[1, 'Jason'] = df['Jason'][1] + 5
print("  Updated df: %d" % df['Jason'][1])
print("  Updated reference_to_df: %d\n\n" % reference_to_df['Jason'][1])

# Create a true copy of my_dataframe
print("Experiment with a true copy:")
copy_of_my_dataframe = my_dataframe.copy()

# Print the starting value of a particular cell.
print("  Starting value of my_dataframe: %d" % my_dataframe['activity'][1])
print("  Starting value of copy_of_my_dataframe: %d\n" % copy_of_my_dataframe['activity'][1])

# Modify a cell in df.
my_dataframe.at[1, 'activity'] = my_dataframe['activity'][1] + 3
print("  Updated my_dataframe: %d" % my_dataframe['activity'][1])
print("  copy_of_my_dataframe does not get updated: %d" % copy_of_my_dataframe['activity'][1])

In [None]:
from pandas import Series, DataFrame
import numpy as np

In [None]:
series_obj=Series(np.arange(0,8,2),index=['row1','row2','row3','row4',])
series_obj

In [None]:
series_obj.iloc[[0,1]]

In [None]:
series_obj['row2']

In [None]:
df=DataFrame({'City':['Seattle','Chicago','NYC'], 'States':['WA','IL','NY'],'Avg':[100,200,500]})

In [None]:
df

Creating Series from a dictionary.

In [None]:
series_2=Series({'City':['Seattle','Chicago','NYC'], 'States':['WA','IL','NY'],'Avg':[100,200,500]})

In [None]:
series_2

Converting a series directly into dataframe.
1. if no index specify, the dictionary key will become index. If we specify here and the index does not match the dictionary key, Naan value will appear.

In [None]:
no_index_df=DataFrame(series_2)
no_index_df

In [None]:
not_matching_index=DataFrame(series_2,index=['City','A','B'])
not_matching_index

How do we change the index then?
1. If we change by .index() as below, the dictionary's key will be replaced by the new provied index list. The new index no longer needs to match the dictionary's key like when we transfrom earlier

In [None]:
dict_df=DataFrame(series_2,index=['City','States','Avg'])
dict_df.index=['Location','Demo','Income']
dict_df

2. If we want to keep the dictionary key, we need to reset the index then assign a new index. The new index can be anyvalue

In [None]:
key_and_new=DataFrame(series_2)
key_and_new
#before

In [None]:
#after. makeing sure the inplace is set, otherwise it will replace, instead of adding another colum.
key_and_new.reset_index(inplace=True)
key_and_new

In [None]:
#change index
key_and_new.index=['Region','Upper region','Income']
key_and_new

In [None]:
key_and_new.index

Retrieving data from specific location

In [None]:
df=key_and_new.copy()
df
df.iloc[[1,2],[0]]

Slicing data from the dataframe

In [None]:
df.iloc[0:2,1]

Accessing an entire column using indexing

In [None]:
df['age']=[30,45,60]
df


In [None]:
df.iloc[:,[0,2]]

Comparing- mask indexing
if not using the outer df[], it will return an object w/ boolean values contains the index position.
df[index statisfying the condition] >-- return actual value of the df

In [None]:
mask=df.iloc[:,2]>35
mask

In [None]:
mask2=df[df.iloc[:,2]>35]
mask2

Replacing value inside the dataframe

In [None]:
df

replace the entire column value,column names vs. replace certain cell that satisfied a condition

In [None]:
df.iloc[:,1]=None
df

In [None]:
df.columns=['Attribute','Values','Age','Outcome']
df

If column Age is greater than 45 or smaller, then change the value in outcome accordingly.
1. accessing to filter the value of age <-- those are the rows that satisfy <-- df.iloc[:,2] because we are filtering along the row
2. using those defined rows, let's access the value in Column " Outcome" for those row and assign the new value. df.loc[filteredrow,'Outcome] <-- now accessing the value in the defined column.

In [None]:
df.iloc[:,2]=df.iloc[:,2].apply(lambda x:int(x))
df.loc[df.iloc[:,2]>45,'Outcome']=1
df.loc[df.iloc[:,2]<45,'Outcome']=0
df.loc[df.iloc[:,2]==45,'Outcome']=1000
df

Modifying the dataframe to replace None value. Below code changing several lines with a list of value, making sure the [0:2] is the same length with the list

In [None]:
df.iloc[0:2,1]=[200,300]
df

Now we are going to find all Null value <-- notice how we can find null value for the entire df or 1 specific columns.
Then replace w an average value

In [None]:
df['Values'].isnull()
df.isnull()

In [None]:
df.fillna(250,inplace=True)
df
# making sure the inplace is specified, otherwise df will not commit the change

What if we want to fill multiple position with multiple values.

In [None]:
from pandas.core import missing
missing_df=df.copy()
missing_df['Homevalue']=None,None,None
missing_df['Numrooms']=20, None,None
missing_df

The input for the fillna() - insde the bracket must be either scaler or a dictionary. If using dictionary then we need to match the column as the key, value is a nested dictionary with key=index(such as: region), value=value. Make sure you use inplace=True to commit the change. fillna() does not override existing data that is not Naan= region remains 20 instead of 1.  

In [None]:
missing_df.fillna({'Homevalue':100000,'Numrooms':{'Region':1 , 'Upper region':2,'Income':3}},inplace=True)
missing_df

Let's try to see what if the nested dictionary isnt matching the index of the df <- the value remains None

In [None]:
second_missing=df.copy()
second_missing['Homevalue']=None,None,None
second_missing['Numrooms']=20, None,None
second_missing.fillna({'Homevalue':100000,'Numrooms':{'Region':1 , 'Home':2,'Income':3}},inplace=True)
second_missing

we can also use fillna(method='ffill') <-- it will replace Naan with the last known value in the column.
Whatis about if we can to count for number of missing values? how much data is missing

In [None]:
count_missing=df.copy()
count_missing['Homevalue']=None,None,None
count_missing['Numrooms']=20, None,None
count_missing

In [None]:
count_missing.isnull().sum()
# it returns the total missing values per columns.

In [None]:
# can we count the total missing value per row then? remember if we want the function to follow direction of row in result, the axis=1
count_missing.isnull().sum(axis=1)


let's drop all the records with null values <- empty DF because everysingle row has an naan

In [None]:
count_missing.dropna(inplace=True)
count_missing

we also can dropna(axis=1) then it will drop all columns that contains None, keeping only the one that has no none.