# Pandas DataFrame


## What is DataFrame?

  A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns. Pandas DataFrame consists of three principal components, the data, rows, and columns.

### Features of DataFrame
  - Potentially columns are of different types
  - Size – Mutable
  - Labeled axes (rows and columns)
  - Can Perform Arithmetic operations on rows and columns
  
  Pandas is a Python library that provides data structures and data analysis tools for different functions
  
  Through pandas, you get acquainted with your data by cleaning, transforming, and analyzing it.

For example, say you want to explore a dataset stored in a CSV on your computer. Pandas will extract the data from that CSV into a DataFrame — a table, basically — then let you do things like:

  - What's the average, median, max, or min of each column?
  - Does column A correlate with column B?
  - What does the distribution of data in column C look like?
  - Clean the data by doing things like removing missing values and filtering rows or columns by some criteria
  - Store the cleaned, transformed data back into a CSV, other file or database  
  
  
## Creating a Pandas DataFrame

  A Pandas DataFrame will be created by loading the datasets from existing storage, storage can be SQL Database, CSV file, and Excel file. Pandas DataFrame can be created from the lists, dictionary, and from a list of dictionary etc.
  
### Creating a dataframe using List: 
  DataFrame can be created using a single list or a list of lists. 

In [1]:
import pandas as pd
 
# list of strings
lst = ['Welcome', 'to', 'Pandas', 'DataFrame']
 
# Calling DataFrame constructor on list
df = pd.DataFrame(lst)
print(df)

           0
0    Welcome
1         to
2     Pandas
3  DataFrame


### Create a DataFrame from Dict of ndarrays / Lists
  All the ndarrays must be of same length. If index is passed, then the length of the index should equal to the length of the arrays.
If no index is passed, then by default, index will be range(n), where n is the array length.

In [14]:
import pandas as pd
data = {'Name':['Goutham', 'Bhanu', 'Sridhar', 'Sandeep'],
        'Age':[32,35,39,30]
       }
df = pd.DataFrame(data, columns = ["Name","Age"])
df

Unnamed: 0,Name,Age
0,Goutham,32
1,Bhanu,35
2,Sridhar,39
3,Sandeep,30


In [2]:
import pandas as pd

In [3]:
nba = pd.read_csv("nba.csv")

## Shared Methods and Attributes

In [4]:
nba = pd.read_csv("nba.csv")

In [6]:
nba.head(1)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0


In [7]:
nba.tail()

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
453,Shelvin Mack,Utah Jazz,8.0,PG,26.0,6-3,203.0,Butler,2433333.0
454,Raul Neto,Utah Jazz,25.0,PG,24.0,6-1,179.0,,900000.0
455,Tibor Pleiss,Utah Jazz,21.0,C,26.0,7-3,256.0,,2900000.0
456,Jeff Withey,Utah Jazz,24.0,C,26.0,7-0,231.0,Kansas,947276.0
457,,,,,,,,,


In [10]:
nba.index

RangeIndex(start=0, stop=458, step=1)

In [11]:
nba.values

array([['Avery Bradley', 'Boston Celtics', 0.0, ..., 180.0, 'Texas',
        7730337.0],
       ['Jae Crowder', 'Boston Celtics', 99.0, ..., 235.0, 'Marquette',
        6796117.0],
       ['John Holland', 'Boston Celtics', 30.0, ..., 205.0,
        'Boston University', nan],
       ..., 
       ['Tibor Pleiss', 'Utah Jazz', 21.0, ..., 256.0, nan, 2900000.0],
       ['Jeff Withey', 'Utah Jazz', 24.0, ..., 231.0, 'Kansas', 947276.0],
       [nan, nan, nan, ..., nan, nan, nan]], dtype=object)

In [12]:
nba.shape

(458, 9)

In [13]:
nba.dtypes

Name         object
Team         object
Number      float64
Position     object
Age         float64
Height       object
Weight      float64
College      object
Salary      float64
dtype: object

In [14]:
nba.head()

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,
3,R.J. Hunter,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0
4,Jonas Jerebko,Boston Celtics,8.0,PF,29.0,6-10,231.0,,5000000.0


Index(['Name', 'Team', 'Number', 'Position', 'Age', 'Height', 'Weight',
       'College', 'Salary'],
      dtype='object')

In [9]:
nba.axes

[RangeIndex(start=0, stop=458, step=1),
 Index(['Name', 'Team', 'Number', 'Position', 'Age', 'Height', 'Weight',
        'College', 'Salary'],
       dtype='object')]

In [17]:
nba.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 458 entries, 0 to 457
Data columns (total 9 columns):
Name        457 non-null object
Team        457 non-null object
Number      457 non-null float64
Position    457 non-null object
Age         457 non-null float64
Height      457 non-null object
Weight      457 non-null float64
College     373 non-null object
Salary      446 non-null float64
dtypes: float64(4), object(5)
memory usage: 32.3+ KB


## Select One Column from a `DataFrame`

In [40]:
nba = pd.read_csv("nba.csv")
nba.head(3)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,


## Select Two or More Columns from A `DataFrame`

In [10]:
nba = pd.read_csv("nba.csv")
nba.head(3)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,


In [11]:
nba[["Team", "Name"]].head(3)
nba[["Number", "College"]]
nba[["Salary", "Team", "Name"]].tail()

Unnamed: 0,Salary,Team,Name
453,2433333.0,Utah Jazz,Shelvin Mack
454,900000.0,Utah Jazz,Raul Neto
455,2900000.0,Utah Jazz,Tibor Pleiss
456,947276.0,Utah Jazz,Jeff Withey
457,,,


In [12]:
select = ["Salary", "Team", "Name"]
nba[select].head(3)

Unnamed: 0,Salary,Team,Name
0,7730337.0,Boston Celtics,Avery Bradley
1,6796117.0,Boston Celtics,Jae Crowder
2,,Boston Celtics,John Holland


## Add New Column to `DataFrame`

In [14]:
nba = pd.read_csv("nba.csv")
nba.head(3)

nba["Sport"] = "Basketball"
nba.head(3)

nba["League"] = "National Basketball Association"
nba.head(3)

nba = pd.read_csv("nba.csv")
nba.head(3)

nba.insert(3, column = "Sport", value = "Basketball")
nba.head(3)

nba.insert(7, column = "League", value = "National Basketball Association")
nba.head(5)

Unnamed: 0,Name,Team,Number,Sport,Position,Age,Height,League,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,Basketball,PG,25.0,6-2,National Basketball Association,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,Basketball,SF,25.0,6-6,National Basketball Association,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,Basketball,SG,27.0,6-5,National Basketball Association,205.0,Boston University,
3,R.J. Hunter,Boston Celtics,28.0,Basketball,SG,22.0,6-5,National Basketball Association,185.0,Georgia State,1148640.0
4,Jonas Jerebko,Boston Celtics,8.0,Basketball,PF,29.0,6-10,National Basketball Association,231.0,,5000000.0


## Broadcasting Operations

In [103]:
nba = pd.read_csv("nba.csv")
nba.head(3)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,


In [114]:
nba["Age"].add(5)
nba["Age"] + 5

nba["Salary"].sub(5000000)
nba["Salary"] - 5000000

nba["Weight"].mul(0.453592)
nba["Weight in Kilograms"] = nba["Weight"] * 0.453592

In [115]:
nba.head(3)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary,Weight in Kilograms
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0,81.64656
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0,106.59412
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,,92.98636


In [119]:
nba["Salary"].div(1000000)
nba["Salary in Millions"] = nba["Salary"] / 1000000

In [120]:
nba.head(3)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary,Weight in Kilograms,Salary in Millions
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0,81.64656,7.730337
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0,106.59412,6.796117
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,,92.98636,


## A Review of the `.value_counts()` Method

In [122]:
nba = pd.read_csv("nba.csv")
nba.head(3)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,


In [16]:
nba["College"].value_counts()


Kentucky               22
Duke                   20
Kansas                 18
North Carolina         16
UCLA                   15
Arizona                13
Florida                10
Texas                   9
Syracuse                8
Connecticut             7
USC                     7
Washington              7
LSU                     6
Michigan                6
Georgetown              6
Michigan State          6
Wake Forest             6
Georgia Tech            6
Marquette               5
Stanford                5
Ohio State              5
Wisconsin               5
Indiana                 4
UNLV                    4
Oklahoma State          4
Villanova               4
Tennessee               4
St. John's              3
Maryland                3
Utah                    3
                       ..
Wichita State           1
Old Dominion            1
Weber State             1
Xavier                  1
Georgia State           1
Boston University       1
Louisiana Tech          1
Louisiana-La

## Drop Rows with Null Values

In [17]:
nba = pd.read_csv("nba.csv")
nba.tail(3)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
455,Tibor Pleiss,Utah Jazz,21.0,C,26.0,7-3,256.0,,2900000.0
456,Jeff Withey,Utah Jazz,24.0,C,26.0,7-0,231.0,Kansas,947276.0
457,,,,,,,,,


In [19]:
nba.dropna(how = "all", inplace = True)

In [20]:
nba.tail(3)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
454,Raul Neto,Utah Jazz,25.0,PG,24.0,6-1,179.0,,900000.0
455,Tibor Pleiss,Utah Jazz,21.0,C,26.0,7-3,256.0,,2900000.0
456,Jeff Withey,Utah Jazz,24.0,C,26.0,7-0,231.0,Kansas,947276.0


In [23]:
nba.dropna(subset = ["Salary", "College"])
output = None

## Fill in Null Values with the `.fillna()` Method

In [20]:
nba = pd.read_csv("nba.csv")
nba.head(3)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,


In [21]:
nba.fillna(0)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,0.0
3,R.J. Hunter,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0
4,Jonas Jerebko,Boston Celtics,8.0,PF,29.0,6-10,231.0,0,5000000.0
5,Amir Johnson,Boston Celtics,90.0,PF,29.0,6-9,240.0,0,12000000.0
6,Jordan Mickey,Boston Celtics,55.0,PF,21.0,6-8,235.0,LSU,1170960.0
7,Kelly Olynyk,Boston Celtics,41.0,C,25.0,7-0,238.0,Gonzaga,2165160.0
8,Terry Rozier,Boston Celtics,12.0,PG,22.0,6-2,190.0,Louisville,1824360.0
9,Marcus Smart,Boston Celtics,36.0,PG,22.0,6-4,220.0,Oklahoma State,3431040.0


In [23]:
nba["Salary"].fillna(0, inplace = True)

In [24]:
nba.head()

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,0.0
3,R.J. Hunter,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0
4,Jonas Jerebko,Boston Celtics,8.0,PF,29.0,6-10,231.0,,5000000.0


In [27]:
nba["College"].fillna("No College", inplace = True)

In [28]:
nba.head()

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,0.0
3,R.J. Hunter,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0
4,Jonas Jerebko,Boston Celtics,8.0,PF,29.0,6-10,231.0,No College,5000000.0


## The `.astype()` Method

In [161]:
nba = pd.read_csv("nba.csv").dropna(how = "all")
nba["Salary"].fillna(0, inplace = True)
nba["College"].fillna("None", inplace = True)
nba.head(6)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,0.0
3,R.J. Hunter,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0
4,Jonas Jerebko,Boston Celtics,8.0,PF,29.0,6-10,231.0,,5000000.0
5,Amir Johnson,Boston Celtics,90.0,PF,29.0,6-9,240.0,,12000000.0


In [188]:
nba.dtypes
nba.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 457 entries, 0 to 456
Data columns (total 9 columns):
Name        457 non-null object
Team        457 non-null category
Number      457 non-null int64
Position    457 non-null category
Age         457 non-null int64
Height      457 non-null object
Weight      457 non-null float64
College     457 non-null object
Salary      457 non-null int64
dtypes: category(2), float64(1), int64(3), object(3)
memory usage: 29.7+ KB


In [166]:
nba["Salary"] = nba["Salary"].astype("int")

In [167]:
nba.head(3)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,0


In [173]:
nba["Number"] = nba["Number"].astype("int")
nba["Age"] = nba["Age"].astype("int")
nba.head(3)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0,PG,25,6-2,180.0,Texas,7730337
1,Jae Crowder,Boston Celtics,99,SF,25,6-6,235.0,Marquette,6796117
2,John Holland,Boston Celtics,30,SG,27,6-5,205.0,Boston University,0


In [175]:
nba["Age"].astype("float")

0      25.0
1      25.0
2      27.0
3      22.0
4      29.0
5      29.0
6      21.0
7      25.0
8      22.0
9      22.0
10     24.0
11     27.0
12     27.0
13     20.0
14     26.0
15     27.0
16     24.0
17     28.0
18     21.0
19     32.0
20     22.0
21     26.0
22     23.0
23     28.0
24     21.0
25     26.0
26     25.0
27     26.0
28     28.0
29     27.0
       ... 
427    20.0
428    25.0
429    23.0
430    24.0
431    27.0
432    23.0
433    28.0
434    34.0
435    24.0
436    25.0
437    24.0
438    23.0
439    26.0
440    30.0
441    20.0
442    28.0
443    23.0
444    24.0
445    20.0
446    24.0
447    23.0
448    26.0
449    23.0
450    28.0
451    26.0
452    20.0
453    26.0
454    24.0
455    26.0
456    26.0
Name: Age, dtype: float64

In [178]:
nba["Position"].nunique()

5

In [181]:
nba["Position"] = nba["Position"].astype("category")

In [186]:
nba["Team"] = nba["Team"].astype("category")

In [187]:
nba.head()

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0,PG,25,6-2,180.0,Texas,7730337
1,Jae Crowder,Boston Celtics,99,SF,25,6-6,235.0,Marquette,6796117
2,John Holland,Boston Celtics,30,SG,27,6-5,205.0,Boston University,0
3,R.J. Hunter,Boston Celtics,28,SG,22,6-5,185.0,Georgia State,1148640
4,Jonas Jerebko,Boston Celtics,8,PF,29,6-10,231.0,,5000000


## Sort a `DataFrame` with the `.sort_values()` Method, Part I

In [2]:
nba = pd.read_csv("nba.csv")
nba.head(3)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,


In [15]:
nba.sort_values("Salary", ascending = False, na_position = "first").tail()

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
175,Jordan McRae,Cleveland Cavaliers,12.0,SG,25.0,6-5,179.0,Tennessee,111196.0
135,Alan Williams,Phoenix Suns,15.0,C,23.0,6-8,260.0,UC Santa Barbara,83397.0
291,Orlando Johnson,New Orleans Pelicans,0.0,SG,27.0,6-5,220.0,UC Santa Barbara,55722.0
130,Phil Pressey,Phoenix Suns,25.0,PG,25.0,5-11,175.0,Missouri,55722.0
32,Thanasis Antetokounmpo,New York Knicks,43.0,SF,23.0,6-7,205.0,,30888.0


## Filter A `DataFrame`

In [26]:
import pandas as pd

In [27]:
df = pd.read_csv("employees.csv", parse_dates = ["Start Date", "Last Login Time"])
df["Senior Management"] = df["Senior Management"].astype("bool")
df["Gender"] = df["Gender"].astype("category")
df.head(3)

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,1993-08-06,2020-02-13 12:42:00,97308,6.945,True,Marketing
1,Thomas,Male,1996-03-31,2020-02-13 06:53:00,61933,4.17,True,
2,Maria,Female,1993-04-23,2020-02-13 11:17:00,130590,11.858,False,Finance


In [29]:
mask1 = df["Gender"] == "Male"
mask2 = df["Team"] == "Marketing"

df[mask1 & mask2].head(5)

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,1993-08-06,2020-02-13 12:42:00,97308,6.945,True,Marketing
21,Matthew,Male,1995-09-05,2020-02-13 02:12:00,100612,13.645,False,Marketing
26,Craig,Male,2000-02-27,2020-02-13 07:45:00,37598,7.757,True,Marketing
74,Thomas,Male,1995-06-04,2020-02-13 14:24:00,62096,17.029,False,Marketing
77,Charles,Male,2004-09-14,2020-02-13 20:13:00,107391,1.26,True,Marketing


## The `.isin()` , `.isnull()` , `.notnull()` and `.between()` Methods

In [33]:
filter1 = df["Gender"].isin(["Female"]) 
filter2 = df["Team"].isin(["Engineering", "Distribution", "Finance" ])
df[filter1 & filter2].head(5)

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
2,Maria,Female,1993-04-23,2020-02-13 11:17:00,130590,11.858,False,Finance
7,,Female,2015-07-20,2020-02-13 10:43:00,45906,11.598,True,Finance
8,Angela,Female,2005-11-22,2020-02-13 06:29:00,95570,18.523,True,Engineering
14,Kimberly,Female,1999-01-14,2020-02-13 07:13:00,41426,14.543,True,Finance
30,Christina,Female,2002-08-06,2020-02-13 13:19:00,118780,9.096,True,Engineering


In [35]:
mask = df["Team"].isnull()

df[mask]
output = None

In [37]:
condition = df["Gender"].notnull()

df[condition]
output = None

## The `.drop_duplicates()`, `.unique()` and `.nunique()` Methods

In [38]:
df = pd.read_csv("employees.csv", parse_dates = ["Start Date", "Last Login Time"])
df["Senior Management"] = df["Senior Management"].astype("bool")
df["Gender"] = df["Gender"].astype("category")
df.sort_values("First Name", inplace = True)
df.head(3)

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
101,Aaron,Male,2012-02-17,2020-02-13 10:20:00,61602,11.849,True,Marketing
327,Aaron,Male,1994-01-29,2020-02-13 18:48:00,58755,5.097,True,Marketing
440,Aaron,Male,1990-07-22,2020-02-13 14:53:00,52119,11.343,True,Client Services


In [39]:
df.drop_duplicates(subset = ["First Name"], keep = False)

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
8,Angela,Female,2005-11-22,2020-02-13 06:29:00,95570,18.523,True,Engineering
688,Brian,Male,2007-04-07,2020-02-13 22:47:00,93901,17.821,True,Legal
190,Carol,Female,1996-03-19,2020-02-13 03:39:00,57783,9.129,False,Finance
887,David,Male,2009-12-05,2020-02-13 08:48:00,92242,15.407,False,Legal
5,Dennis,Male,1987-04-18,2020-02-13 01:35:00,115163,10.125,False,Legal
495,Eugene,Male,1984-05-24,2020-02-13 10:54:00,81077,2.117,False,Sales
33,Jean,Female,1993-12-18,2020-02-13 09:07:00,119082,16.18,False,Business Development
832,Keith,Male,2003-02-12,2020-02-13 15:02:00,120672,19.467,False,Legal
291,Tammy,Female,1984-11-11,2020-02-13 10:30:00,132839,17.463,True,Client Services


In [41]:
df["Gender"].unique()

[Male, NaN, Female]
Categories (2, object): [Male, Female]

In [46]:
len(df["Team"].unique())

11

In [47]:
df["Team"].nunique(dropna = False)

11

## Working with Text Data

In [48]:
import pandas as pd 

In [49]:
chicago = pd.read_csv("chicago.csv")
chicago["Department"] = chicago["Department"].astype("category")
chicago.head(3)

Unnamed: 0,Name,Position Title,Department,Employee Annual Salary
0,"AARON, ELVIA J",WATER RATE TAKER,WATER MGMNT,$90744.00
1,"AARON, JEFFERY M",POLICE OFFICER,POLICE,$84450.00
2,"AARON, KARINA",POLICE OFFICER,POLICE,$84450.00


In [50]:
chicago["Department"].nunique()

35

In [51]:
chicago["Department"].count()

32062

### `lower()`, `upper()`, `title()` and `len()` methods

In [55]:
chicago["Department"].str.lower()
chicago["Department"].str.upper()
chicago["Department"].str.title()
chicago["Department"].str.len()

### `.str.replace() , .strip(), .lstrip() , rstrip() and split()` method

In [57]:
chicago["Name"].str.replace(",","")

In [59]:
chicago["Name"].str.strip()

In [None]:
chicago["Name"].str.split(",").str.get(0)