<center><font size=6 color="#00416d">DataFrame-2</font></center>

Here in this notebook we mainly learn how to filter a DataFrame for data

In [None]:
import pandas as pd

In [None]:
# Creating a DataFrame
df = pd.read_csv("employees.csv")
df.head(5)

In [None]:
# Get big picture on data
df.info()

In [None]:
# We will save memory by performing some optimization
# converting string type date into datetime object
df["Start Date"] = pd.to_datetime(df["Start Date"])
df.head(3)

In [None]:
# In the same way we convert Last Login Time
# Since we are converting as a datetime it will take today as date
df["Last Login Time"] = pd.to_datetime(df["Last Login Time"])
df.head(3)

In [None]:
# In the similar way we perform of some of other columns too
df["Gender"] = df["Gender"].astype("category")
df["Senior Management"] = df["Senior Management"].astype("category")
df.head(3)

In [None]:
df.info()

Our memory usage decreased from 62.6 to 49.2

In [None]:
# Above all things we can do in command
df = pd.read_csv("employees.csv", parse_dates=["Start Date", "Last Login Time"], 
                 dtype={'Gender':"category", "Senior Management":"boolean"})

In [None]:
df.head(5)

In [None]:
df.info()

In [None]:
# To count number of null values in a column
df["First Name"].isnull().sum()

### Filter a DataFrame base on a condition

In [None]:
df = pd.read_csv("employees.csv", parse_dates=["Start Date", "Last Login Time"],
                 dtype={'Gender':"category", "Senior Management":"boolean"})
df.head()

In [None]:
# It will return True if Team is Finance
df["Team"] == "Finance"

In [None]:
# It will fectch the rows who value is true
df[df["Team"] == "Finance"]

In [None]:
# If you have boolean data in your column you can directly use it
df[df["Senior Management"]]

In [None]:
# We can fetch who value is greater than 100000
df[df["Salary"] >= 100000]

In [None]:
# We can also apply same logic on dates
df[df["Start Date"] >= "2000-01-01"]

### To fetch all rows that does not match the condition (~)

In [None]:
df = pd.read_csv("employees.csv", parse_dates=["Start Date", "Last Login Time"],
                 dtype={'Gender':"category", "Senior Management":"boolean"})
df.head()

In [None]:
# To fetch all genders other than male
df[~(df["Gender"] == "Male")]

### Fetching multiple columns using AND - &

In [None]:
df = pd.read_csv("employees.csv", parse_dates=["Start Date", "Last Login Time"],
                 dtype={'Gender':"category", "Senior Management":"boolean"})
df.head()

In [None]:
# we will fetch all the row whos Gender is Male and Team is Finance
males = df["Gender"] == "Male"
finance = df["Team"] == "Finance"

In [None]:
# Here it will fetch the rows only if both values are true
df[males & finance]

###  Fetching multiple columns using or - |

In [None]:
df = pd.read_csv("employees.csv", parse_dates=["Start Date", "Last Login Time"],
                dtype={'Gender':"category", "Senior Management":"boolean"})
df.head(3)

In [None]:
# We will fetch all rows who Gender is Male or who start date is after 2000-01-01
male = df['Gender'] == "Female"
start_date = df["Start Date"] > "2000-01-01"
df[male | start_date].head(4)

In [None]:
# Now we will combine both AND and OR
# Condition-1: Gender should be Male and Start Date should be creater than 2015-01-01
# Condition-2: If it does not satisfy condition then Team should be at lease Finance
male = df['Gender'] == "Female"
start_date = df["Start Date"] > "2015-01-01"
team = df["Team"] == "Finance"
df[(male & start_date) | team]

### .isin() method
<b>Documentation:</b><a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.isin.html">df.isin()</a><br>
<b>Use:</b>To check whether each element in the DataFrame is contained in value

In [None]:
df = pd.DataFrame({"America": [4, 6, 8], "India": [1, 2, 3]}, index=["Car", "Bike", "Auto"])
df

In [None]:
df.isin([2, 3])

In [None]:
df = pd.read_csv("employees.csv", parse_dates=["Start Date", "Last Login Time"], 
                 dtype={"Gender":"category", "Senior Management":"boolean", "Team": "category"})
df.head(5)

In [None]:
# Condition-1: Let's bring all rows from DataFrame the maches "Finance", "Marketing", "Client Service" in "Teams" column
condition = df["Team"].isin(["Finance", "Marketing", "Client Service"])
df[condition]

### The .isnull() and .notnull() methods
<b>.isnull()</b>:<br>
Documentation: <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.isnull.html?highlight=isnull">isnull</a> <br>
Use:Detect missing values.<br><br>
<b>.notnull()</b>: <br>
Documentation:<a href="https://pandas.pydata.org/docs/reference/api/pandas.notnull.html">notnull()</a><br>
Use:Detect non-missing values for an array-like object.

In [None]:
df = pd.read_csv("employees.csv", parse_dates=["Start Date", "Last Login Time"], 
                 dtype={"Gender":"category", "Senior Management":"boolean", "Team": "category"})
df.head(5)

In [None]:
# Fetching all NaN row based on Team column
df[df["Team"].isnull()]

In [None]:
# Fetching all Non NaN row based on Team column
df[df["Team"].notnull()]

### The .between() method
Documentation:<a href="https://pandas.pydata.org/docs/reference/api/pandas.Series.between.html">.between()</a><br>
Use:Return boolean Series equivalent to left <= series <= right.<br>
This function returns a boolean vector containing True wherever the corresponding Series element is between the boundary values left and right. NA values are treated as False.

In [None]:
df = pd.read_csv("employees.csv", parse_dates=["Start Date", "Last Login Time"], 
                 dtype={"Gender":"category", "Senior Management":"boolean", "Team": "category"})
df.head(5)

In [None]:
df[df["Salary"].between(90000, 100000)]

In [None]:
df[df["Start Date"].between("2001-09-21", "2002-09-21")]

In [None]:
df[df["Last Login Time"].between("10:00AM", "12:00PM")]

### The .duplicated() method
Return boolean Series denoting duplicate rows.<br>
<b> Documentation</b>:<a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.duplicated.html?highlight=duplicate">duplicated()</a>

In [None]:
df = pd.DataFrame({'names':["Kiran", "Kiran", "James", "Kiran", "James", "Madhu"],
                   'ages':[20, 30, 45, 69, 59, 23]})
df.sort_values(by=['names'], inplace=True)
df

In [None]:
# If you keep last, let assume you have 3 duplicated element then first 2 will be true and 3rd element will be false
df['names'].duplicated(keep="last")

In [None]:
# If name repeats then it will set true
condition = df['names'].duplicated(keep=False)
condition

In [None]:
# To fetch non dulicated rows
df[~condition]

### The .drop_duplicates() methods

In [None]:
df = pd.read_csv("employees.csv", parse_dates=["Start Date", "Last Login Time"], 
                 dtype={"Gender":"category", "Senior Management":"boolean", "Team": "category"})
df.sort_values(by=["First Name"], inplace=True)
df.head(5)

In [None]:
# Checking the shape
df.shape

In [None]:
# This is because there are no identical rows
df.drop_duplicates().shape

In [None]:
# Drop rows base on "First Name" column
# By default it will drops the all rows except first occurence, 
# you can change it "last" if you want to keep "last" occurence by changing "keep" parameter value
df.drop_duplicates(subset=["First Name"]).shape

In [None]:
# To drop all duplicate rows if ["First Name"] is repeated by changing "keep" value to False
df.drop_duplicates(subset=["First Name"], keep=False )

In [None]:
# Becareful while using drop_duplicates on category columns because most of the values are repeated
df.drop_duplicates(subset=["Team"], keep=False)

In [None]:
df.head(10)

In [None]:
# We can also multiple columns to subset
# which mean "First Name", "Team" are similar in more than 1 row then those all rows will get deleted
df.drop_duplicates(subset=["First Name", "Team"], keep=False).head(10)

### The .unique() and .nunique() methods

In [None]:
df = pd.read_csv("employees.csv", parse_dates=["Start Date", "Last Login Time"], 
                 dtype={"Gender":"category", "Senior Management":"boolean", "Team": "category"})
df.head()

In [None]:
df["Team"].unique()

In [None]:
# It wouldn't count Null values
df["Team"].nunique()

In [None]:
# To count null values
df["Team"].nunique(dropna=False)

## Synposis

These are just provide one line description, go throgh the example to understand them in depth

<table style="margin-left: 0;">
  <tr style="text-align:center;">
    <th>Implementation</th>
    <th>Description</th>
  </tr>
  <tr>
      <td>df[ condition1 & condition2 ]</td>
      <td>It fetches the row if and only if the condition1 and condition2 are true</td>
  </tr>
  <tr>
      <td>df[ condition1 | condition2 ]</td>
      <td>It fetches the row if atleast one condition is true</td>
  </tr>
  <tr>
      <td>df.isin([value,...])<br> df["column"].isin([value,...])</td>
      <td>We can use is in method on both Series and DataFrame. It will return true if value is present</td>
  </tr>
  <tr>
      <td>df["column"].isnull()</td>
      <td>Returns True if value is null else False</td>
  </tr>
  <tr>
      <td>df["column"].notnull()</td>
      <td>Returns True if value is not null else False</td>
  </tr>
  <tr>
      <td>df["column"].between(start, end)</td>
      <td>To fetch all rows that satisfies the range</td>
  </tr>
  <tr>
      <td>df["column"].duplicated()</td>
      <td>Lets assume a name repeated thrice, then first two values marked as True and last value marked as False.</td>
  </tr>
  <tr>
      <td>df["column"].drop_duplicates()</td>
      <td>It drops all the duplicated rows</td>
  </tr>
  <tr>
      <td>df["column"].unique()</td>
      <td>I will give the list of all unique values</td>
  </tr>
  <tr>
      <td>df["column"].nunique()</td>
      <td>I will give the count of all unique values</td>
  </tr>
</table>