# Data management using Pandas

**Data management** is a crucial component to statistical analysis and data science work.

This notebook will show you how to import, view, undertand, and manage your data using the [Pandas](http://pandas.pydata.org) data processing library, i.e., the notebook will demonstrates how to read a dataset into Python, and obtain a basic understanding of its content.

Note that **Python** by itself is a general-purpose programming language and does not provide high-level data processing capabilities.  The **Pandas** library was developed to meet this need. **Pandas** is the most popular Python library for data manipulation, and we will use it extensively in this course. **Pandas** provides high-performance, easy-to-use data structures and data analysis tools.

The main data structure that **Pandas** works with is called a **Data Frame**. This is a two-dimensional table of data in which the rows typically
represent cases and the columns represent variables (e.g. data used in this tutorial).  Pandas also has a one-dimensional data structure called a **Series** that we will encounter when accesing a single column of a Data Frame.

Pandas has a variety of functions named `read_xxx` for reading data in different formats.  Right now we will focus on reading `csv` files, which stands for comma-separated values. However the other file formats include `excel`, `json`, and `sql`.

There are many other options to `read_csv` that are very useful.  For example, you would use the option `sep='\t'` instead of the default `sep=','` if the fields of your data file are delimited by tabs instead of commas.  See [here](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) for the full documentation for `read_csv`.


## Acknowledgments

- The dataset used in this tutorial is from https://www.coursera.org/ from the course "Understanding and Visualizing Data with Python" by University of Michigan


# Importing libraries


In [71]:
# Import the packages that we will be using
import pandas as pd

#Project route
absoluteRoute = "/Users/benjaminortiz/Documents/GitHub/TC1002S/NotebooksStudents/A01277673/SecondClass/"
routeOfIris = absoluteRoute + "/Data/iris.data"



# Importing data

In [72]:
# Convert to a Pandas DataFrame for easier manipulation
df = pd.read_csv(routeOfIris)
df = df.rename(columns={'5.1':'sepal length (cm)'})
df = df.rename(columns={'3.5':'sepal width (cm)'})
df = df.rename(columns={'1.4':'petal length (cm)'})
df = df.rename(columns={'0.2':'petal width (cm)'})
df = df.rename(columns={'Iris-setosa':'Class'})



If we want to print the information about th output object type we would simply type the following: type(df)

# Exploring the content of the data set

Use the `shape` method to determine the numbers of rows and columns in a data frame. This can be used to confirm that we have actually obtained the data the we are expecting.

Based on what we see below, the data set being read here has $N_r$ rows, corresponding to $N_r$ observations, and $N_c$ columns, corresponding to $N_c$ variables in this particular data file.

In [73]:
#Print the number of rows
print("Number of rows:", df.shape[0])

#Print the number of columns
print("Number of columns:", df.shape[1])


Number of rows: 149
Number of columns: 5


If we want to show the entire data frame we would simply write the following:

In [74]:
#Print the entire dataframe
print(df)

     sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
0                  4.9               3.0                1.4               0.2   
1                  4.7               3.2                1.3               0.2   
2                  4.6               3.1                1.5               0.2   
3                  5.0               3.6                1.4               0.2   
4                  5.4               3.9                1.7               0.4   
..                 ...               ...                ...               ...   
144                6.7               3.0                5.2               2.3   
145                6.3               2.5                5.0               1.9   
146                6.5               3.0                5.2               2.0   
147                6.2               3.4                5.4               2.3   
148                5.9               3.0                5.1               1.8   

              Class  
0    

As you can see, we have a 2-Dimensional object where each row is an independent observation and each coloum is a variable.

Now, use the the `head()` function to show the first 5 rows of our data frame

In [75]:
#Showing the first five rows of the dataframe
df.head(5)

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),Class
0,4.9,3.0,1.4,0.2,Iris-setosa
1,4.7,3.2,1.3,0.2,Iris-setosa
2,4.6,3.1,1.5,0.2,Iris-setosa
3,5.0,3.6,1.4,0.2,Iris-setosa
4,5.4,3.9,1.7,0.4,Iris-setosa


Also, you can use the the `tail()` function to show the last 5 rows of our data frame

In [76]:
#Showing the last 5 rows of our dataframe
df.tail(5)

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),Class
144,6.7,3.0,5.2,2.3,Iris-virginica
145,6.3,2.5,5.0,1.9,Iris-virginica
146,6.5,3.0,5.2,2.0,Iris-virginica
147,6.2,3.4,5.4,2.3,Iris-virginica
148,5.9,3.0,5.1,1.8,Iris-virginica


The columns in a Pandas data frame have names, to see the names, use the `columns` method:

To gather more information regarding the data, we can view the column names with the following function:

In [77]:
#Printing the names of the columns
print(df.columns)

Index(['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)',
       'petal width (cm)', 'Class'],
      dtype='object')


Be aware that every variable in a Pandas data frame has a data type.  There are many different data types, but most commonly you will encounter floating point values (real numbers), integers, strings (text), and date/time values.  When Pandas reads a text/csv file, it guesses the data types based on what it sees in the first few rows of the data file.  Usually it selects an appropriate type, but occasionally it does not.  To confirm that the data types are consistent with what the variables represent, inspect the `dtypes` attribute of the data frame.

In [78]:
#Printing the datatypes of the dataframe
print(df.dtypes)

sepal length (cm)    float64
sepal width (cm)     float64
petal length (cm)    float64
petal width (cm)     float64
Class                 object
dtype: object


Summary statistics, which include things like the mean, min, and max of the data, can be useful to get a feel for how large some of the variables are and what variables may be the most important.

In [79]:
# Summary statistics for the quantitative variables
numerical_df = df.select_dtypes(include=['number'])

#We can filter the numerical columns and make a new dataframe containing only those ones
numerical_df.describe()


Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
count,149.0,149.0,149.0,149.0
mean,5.848322,3.051007,3.774497,1.205369
std,0.828594,0.433499,1.759651,0.761292
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.4,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


In [80]:
# Drop observations with NaN values
# There's no NaN values
#Does not take into account any NaN values
#

It is also possible to get statistics on the entire data frame or a column as follows

- `df.mean()` Returns the mean of all columns
- `df.corr()` Returns the correlation between columns in a data frame
- `df.count()` Returns the number of non-null values in each data frame column
- `df.max()` Returns the highest value in each column
- `df.min()` Returns the lowest value in each column
- `df.median()` Returns the median of each column
- `df.std()` Returns the standard deviation of each column

In [81]:
#Some examples
print(numerical_df.mean()) #Printing the mean of all numerical columns
print("\n")
#The max petal width
max_petal_width = df['petal width (cm)'].max()
print("The maximum petal width is:", max_petal_width) 

#The min sepal length
min_sepal_length = df['sepal length (cm)'].min()
print("The minimum sepal length is:", min_sepal_length) 


sepal length (cm)    5.848322
sepal width (cm)     3.051007
petal length (cm)    3.774497
petal width (cm)     1.205369
dtype: float64


The maximum petal width is: 2.5
The minimum sepal length is: 4.3


# How to write a data frame to a File

To save a file with your data simply use the `to_csv` attribute

Examples:
- df.to_csv('myDataFrame.csv')
- df.to_csv('myDataFrame.csv', sep='\t')

In [82]:
#Saving the numerical dataframe
numerical_df.to_csv(absoluteRoute+'/irisDataSet.csv')


# Rename columns

To change the name of a colum use the `rename` attribute

Example:

df = df.rename(columns={"Age": "Edad"})

df.head()

In [83]:
#Renaming ID
df = df.rename(columns={'sepal length (cm)':'Numerical ID'})
df.head(1)

Unnamed: 0,Numerical ID,sepal width (cm),petal length (cm),petal width (cm),Class
0,4.9,3.0,1.4,0.2,Iris-setosa


In [84]:
# Back to the original name
df = df.rename(columns={'Numerical ID':'sepal length (cm)'})
df.head(1)

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),Class
0,4.9,3.0,1.4,0.2,Iris-setosa


# Selection of colums

As discussed above, a Pandas data frame is a rectangular data table, in which the rows represent observations or samples and the columns represent variables.  One common manipulation of a data frame is to extract the data for one case or for one variable.  There are several ways to do this, as shown below.

To extract all the values for one column (variable), use one of the following alternatives.

In [85]:
#Selecting all values of a single column by its name
a = df.iloc[:,1]

print(a)



0      3.0
1      3.2
2      3.1
3      3.6
4      3.9
      ... 
144    3.0
145    2.5
146    3.0
147    3.4
148    3.0
Name: sepal width (cm), Length: 149, dtype: float64


# Slicing a data set

As discussed above, a Pandas data frame is a rectangular data table, in which the rows represent cases and the columns represent variables.  One common manipulation of a data frame is to extract the data for one observation or for one variable.  There are several ways to do this, as shown below.

Lets say we would like to splice our data frame and select only specific portions of our data.  There are three different ways of doing so.

1. .loc()
2. .iloc()
3. .ix()

We will cover the .loc() and .iloc() splicing functions.


The attibute **.loc()** uses labels/column names, in specific, it takes two single/list/range operator separated by ',', the first one indicates the rows and the second one indicates columns.

In [86]:
# Return all observations of sepal width (cm)
#df.loc[:,"sepal width (cm)"]

# Return a subset of observations of sepal width (cm)
#df.loc[:9, "sepal width (cm)"]

# Select all rows for multiple columns, ["sepal width (cm)", "sepal length (cm)"]
#df.loc[:,["sepal width (cm)", "sepal length (cm)"]]

# Select multiple columns, ["sepal width (cm)", "sepal length (cm)"]
keep = ['sepal width (cm)', 'sepal length (cm)']
df_sepal = df[keep]

# Select few rows for multiple columns, ["sepal width (cm)", "sepal length (cm)"]
#Range based slicing for certaing columns only: 
df.loc[4:9, ["sepal width (cm)", "sepal length (cm)"]]

# Select range of rows for all columns
#df.loc[10:15,:]



Unnamed: 0,sepal width (cm),sepal length (cm)
4,3.9,5.4
5,3.4,4.6
6,3.4,5.0
7,2.9,4.4
8,3.1,4.9
9,3.7,5.4


The attribute **iloc()** is an integer based slicing.

In [87]:
# The first integer represents the rows and the second one represents the columns
df.iloc[:, :4] #This should return all values of the first four columns

# This example should return the first four values of all the columns
df.iloc[:4, :]

# This example should return all the rows within the third column and seventh column
df.iloc[:, 3:7]

# This example should return the range of rows between the fourth and eight row.
# These rows will be between the second and the fourth column
df.iloc[4:8, 2:4]

# This is incorrect:
#df.iloc[1:5, ["sepal width (cm)", "sepal length (cm)"]]

Unnamed: 0,petal length (cm),petal width (cm)
4,1.7,0.4
5,1.4,0.3
6,1.5,0.2
7,1.4,0.2


# Get unique existing values

List unique values in the one of the columns

df.Gender.unique()


In [88]:
# List unique values in the df['Class'] column
df['Class'].unique()

array(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'], dtype=object)

# Filter, Sort and Groupby



With **Filter** you can use different conditions to filter columns. For example, df[df[year] > 1984] would give you only the column year is greater than 1984. You can use & (and) or | (or) to add different conditions to your filtering. This is also called boolean filtering.

df[df["Height"] >= 70]

In [89]:
#Filtering by sepal width (cm) the dataFrame
print(df[df["sepal width (cm)"] >= 3.0].iloc[:10, :])

    sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
0                 4.9               3.0                1.4               0.2   
1                 4.7               3.2                1.3               0.2   
2                 4.6               3.1                1.5               0.2   
3                 5.0               3.6                1.4               0.2   
4                 5.4               3.9                1.7               0.4   
5                 4.6               3.4                1.4               0.3   
6                 5.0               3.4                1.5               0.2   
8                 4.9               3.1                1.5               0.1   
9                 5.4               3.7                1.5               0.2   
10                4.8               3.4                1.6               0.2   

          Class  
0   Iris-setosa  
1   Iris-setosa  
2   Iris-setosa  
3   Iris-setosa  
4   Iris-setosa  
5   Iris-se

With **Sort** is possible to sort values in a certain column in an ascending order using `df.sort_values("ColumnName")` or in descending order using `df.sort_values(ColumnName, ascending=False)`.

Furthermore, it’s possible to sort values by Column1Name in ascending order then Column2Name in descending order by using `df.sort_values([Column1Name,Column2Name],ascending=[True,False])`


df.sort_values("Height")
#df.sort_values("Height",ascending=False)

In [90]:
#Sorting the dataframe by every contestant score
#Printing the first 10 flowers of that have the greater sepal width

print(df.sort_values("sepal width (cm)", ascending=False).iloc[0:10, 0:2])

     sepal length (cm)  sepal width (cm)
14                 5.7               4.4
32                 5.5               4.2
31                 5.2               4.1
13                 5.8               4.0
4                  5.4               3.9
15                 5.4               3.9
130                7.9               3.8
43                 5.1               3.8
17                 5.7               3.8
18                 5.1               3.8


The attribute **Groupby** involves splitting the data into groups based on some criteria, applying a function to each group independently and combining the results into a data structure. df.groupby(col) returns a groupby object for values from one column while df.groupby([col1,col2]) returns a groupby object for values from multiple columns.

df.groupby(['Gender'])

In [91]:
#Clasifying by gender
df_class = df.groupby(['Class'])

#How many females and males there are: 
print(df_class.size())


Class
Iris-setosa        49
Iris-versicolor    50
Iris-virginica     50
dtype: int64


Size of each group

df.groupby(['Gender']).size()

df.groupby(['Gender','GenderGroup']).size()

# Data Cleaning: handle with missing data

Before getting started to work with your data, it's a good practice to observe it thoroughly to identify missing values and handle them accordingly.

When reading a dataset using Pandas, there is a set of values including 'NA', 'NULL', and 'NaN' that are taken by default to represent a missing value.  The full list of default missing value codes is in the '`read_csv`' documentation [here](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html).  This document also explains how to change the way that '`read_csv`' decides whether a variable's value is missing.

Pandas has functions called `isnull` and `notnull` that can be used to identify where the missing and non-missing values are located in a data frame.  

Below we use these functions to count the number of missing and non-missing values in each variable of the datasetr.

In [92]:
newDf = pd.read_csv(routeOfIris)

#Is there any null data?
newDf.isnull().sum()

#How much of the data is not null?
newDf.notnull().sum()

5.1            149
3.5            149
1.4            149
0.2            149
Iris-setosa    149
dtype: int64

Now we use these functions to count the number of missing and non-missing values in a single variable in the dataset

print( df.Height.notnull().sum() )

print( pd.isnull(df.Height).sum() )

# Add and eliminate columns

In some cases it is useful to create or eiminate new columns

In [93]:
# # Eliminate inserted column
# df.drop("ColumnInserted", axis=1, inplace = True)
# # Remove three columns as index base
# #df.drop(df.columns[[12]], axis = 1, inplace = True)

print(df_class.head())
#Especifying the axis where we want to remove the data from.
df_gender = df.drop("Class", axis = 1)
print(df_gender.head())

     sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
0                  4.9               3.0                1.4               0.2   
1                  4.7               3.2                1.3               0.2   
2                  4.6               3.1                1.5               0.2   
3                  5.0               3.6                1.4               0.2   
4                  5.4               3.9                1.7               0.4   
49                 7.0               3.2                4.7               1.4   
50                 6.4               3.2                4.5               1.5   
51                 6.9               3.1                4.9               1.5   
52                 5.5               2.3                4.0               1.3   
53                 6.5               2.8                4.6               1.5   
99                 6.3               3.3                6.0               2.5   
100                5.8      

In [94]:
# # Add new column derived from existing columns
df['petal_area'] = df['petal length (cm)'] * df['petal width (cm)']

df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),Class,petal_area
0,4.9,3.0,1.4,0.2,Iris-setosa,0.28
1,4.7,3.2,1.3,0.2,Iris-setosa,0.26
2,4.6,3.1,1.5,0.2,Iris-setosa,0.3
3,5.0,3.6,1.4,0.2,Iris-setosa,0.28
4,5.4,3.9,1.7,0.4,Iris-setosa,0.68


In [95]:
# # Eliminate inserted column
df.drop("petal_area", axis=1, inplace = True)
#
df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),Class
0,4.9,3.0,1.4,0.2,Iris-setosa
1,4.7,3.2,1.3,0.2,Iris-setosa
2,4.6,3.1,1.5,0.2,Iris-setosa
3,5.0,3.6,1.4,0.2,Iris-setosa
4,5.4,3.9,1.7,0.4,Iris-setosa


In [96]:
## Add a new column with strata based on these cut points
#
## Create and insert a column petal area
df['petal_area'] = df['petal length (cm)'] * df['petal width (cm)']

# Insert values into the dataframe:
df["PetalStrata"] = pd.cut(df["petal_area"], [0., .3, .4, .5, .6, .7])
#
## Show the first 5 rows of the created data frame
df




Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),Class,petal_area,PetalStrata
0,4.9,3.0,1.4,0.2,Iris-setosa,0.28,"(0.0, 0.3]"
1,4.7,3.2,1.3,0.2,Iris-setosa,0.26,"(0.0, 0.3]"
2,4.6,3.1,1.5,0.2,Iris-setosa,0.30,"(0.3, 0.4]"
3,5.0,3.6,1.4,0.2,Iris-setosa,0.28,"(0.0, 0.3]"
4,5.4,3.9,1.7,0.4,Iris-setosa,0.68,"(0.6, 0.7]"
...,...,...,...,...,...,...,...
144,6.7,3.0,5.2,2.3,Iris-virginica,11.96,
145,6.3,2.5,5.0,1.9,Iris-virginica,9.50,
146,6.5,3.0,5.2,2.0,Iris-virginica,10.40,
147,6.2,3.4,5.4,2.3,Iris-virginica,12.42,


In [97]:
## Eliminate inserted column
df.drop("PetalStrata", axis=1, inplace = True)
df.drop("petal_area", axis=1, inplace = True)
#
df.head()




Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),Class
0,4.9,3.0,1.4,0.2,Iris-setosa
1,4.7,3.2,1.3,0.2,Iris-setosa
2,4.6,3.1,1.5,0.2,Iris-setosa
3,5.0,3.6,1.4,0.2,Iris-setosa
4,5.4,3.9,1.7,0.4,Iris-setosa


# Add and eliminate rows

In some cases it is requiered to add new observations (rows) to the data set

In [98]:
# Print tail
df.tail(5)


Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),Class
144,6.7,3.0,5.2,2.3,Iris-virginica
145,6.3,2.5,5.0,1.9,Iris-virginica
146,6.5,3.0,5.2,2.0,Iris-virginica
147,6.2,3.4,5.4,2.3,Iris-virginica
148,5.9,3.0,5.1,1.8,Iris-virginica


In [99]:

#This appends a new row to the data frame
#Each of the values in the vector represents a column in the dataframe.
df.loc[len(df.index)] = [5.1, 3.5, 1.4, 0.2, 'Iris-setosa']
#
df.tail()




Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),Class
145,6.3,2.5,5.0,1.9,Iris-virginica
146,6.5,3.0,5.2,2.0,Iris-virginica
147,6.2,3.4,5.4,2.3,Iris-virginica
148,5.9,3.0,5.1,1.8,Iris-virginica
149,5.1,3.5,1.4,0.2,Iris-setosa


In [100]:
## Eliminate inserted row
df.drop([len(df.index)-1], inplace = True )
#
df.tail()




Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),Class
144,6.7,3.0,5.2,2.3,Iris-virginica
145,6.3,2.5,5.0,1.9,Iris-virginica
146,6.5,3.0,5.2,2.0,Iris-virginica
147,6.2,3.4,5.4,2.3,Iris-virginica
148,5.9,3.0,5.1,1.8,Iris-virginica


# Cleaning your data: drop out unused columns and/or drop out rows with any missing values

In [101]:
# Drop unused columns
#vars = ["ID", "GenderGroup", "GlassesGroup", "CompleteGroup"]
#df.drop(vars, axis=1, inplace = True)

#vars = ["Age", "Gender", "Glasses", "Height", "Wingspan", "CWDistance", "Complete", "Score"]
#df = df[vars]

# Drop rows with any missing values
#df = df.dropna()

# Drop unused columns and drop rows with any missing values
#vars = ["Age", "Gender", "Glasses", "Height", "Wingspan", "CWDistance", "Complete", "Score"]
#df = df[vars].dropna()

#df


# Final remarks


- The understanding of your dataset is essential
    - Number of observations
    - Variables
    - Data types: numerical or categorial
    - What are my variables of interest

- There are several ways to do the same thing

- Cleaning your dataset (dropping out rows with any missing values) is a good practice

- The **Pandas** library provides fancy, high-performance, easy-to-use data structures and data analysis tools


# Activity: work with the iris dataset

Repeat this tutorial with the iris data set and respond to the following inquiries

1. Calculate the statistical summary for each quantitative variables. Explain the results
    - Identify the name of each column
    - Identify the type of each column
    - Minimum, maximum, mean, average, median, standar deviation
    
    
2. Are there missing data? If so, create a new dataset containing only the rows with the non-missing data


3. Create a new dataset containing only the petal width and length and the type of Flower


4. Create a new dataset containing only the setal width and length and the type of Flower


5. Create a new dataset containing the setal width and length and the type of Flower encoded as a categorical numerical column




In [102]:
#Calculate the statistical summary for each quantitative variables. Explain the results
#Selecting only the numerical values
numericalIrisDf = df.select_dtypes(include='number')

#Obtaining its summary statistics
print(numericalIrisDf.describe())

       sepal length (cm)  sepal width (cm)  petal length (cm)  \
count         149.000000        149.000000         149.000000   
mean            5.848322          3.051007           3.774497   
std             0.828594          0.433499           1.759651   
min             4.300000          2.000000           1.000000   
25%             5.100000          2.800000           1.600000   
50%             5.800000          3.000000           4.400000   
75%             6.400000          3.300000           5.100000   
max             7.900000          4.400000           6.900000   

       petal width (cm)  
count        149.000000  
mean           1.205369  
std            0.761292  
min            0.100000  
25%            0.300000  
50%            1.300000  
75%            1.800000  
max            2.500000  


# Explaining the results

- The "count" attribute represents the total amount of rows taken into account by the function describe to analyze the data.
- "mean" refers to the sum of that category divided by the "count" property (mean).
- "std" refers to the standard deviation of that category, meaning how much a value deviates from its mean. A higher standard deviation indicates that the data points are more spread out from the mean.
- "min" indicates the minimum value of that column.
- "25%" the first quartile, its value represents the value that 25% of this dataset rows has. For example: in petal width, the value of "25%" is 0.3, which means that 25% of the dataset has 0.3 petal width.
- "50%" also referred to as "median", represents the middle value of that particular value in that column.
- "75%" it means that 75% of the dataset has that specific value or less. If we take the petal width as an example: its "75%" value is 1.8, meaning that 75% of the data has a petal width of 1.8 or less.
- "maximum" refers to the max value in that specific column.

In [103]:
# Identify the name of each column
print(df.columns)
# Identify the type of each column
print(df.dtypes)


Index(['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)',
       'petal width (cm)', 'Class'],
      dtype='object')
sepal length (cm)    float64
sepal width (cm)     float64
petal length (cm)    float64
petal width (cm)     float64
Class                 object
dtype: object



3. Create a new dataset containing only the petal width and length and the type of Flower


In [104]:
excludeColumns = ["sepal length (cm)", "sepal width (cm)"]
newDataFrame = df.drop(excludeColumns, axis=1)
print(newDataFrame)

     petal length (cm)  petal width (cm)           Class
0                  1.4               0.2     Iris-setosa
1                  1.3               0.2     Iris-setosa
2                  1.5               0.2     Iris-setosa
3                  1.4               0.2     Iris-setosa
4                  1.7               0.4     Iris-setosa
..                 ...               ...             ...
144                5.2               2.3  Iris-virginica
145                5.0               1.9  Iris-virginica
146                5.2               2.0  Iris-virginica
147                5.4               2.3  Iris-virginica
148                5.1               1.8  Iris-virginica

[149 rows x 3 columns]


4. Create a new dataset containing only the sepal width and length and the type of Flower

In [105]:
excludeColumns = ["petal length (cm)", "petal width (cm)"]
newDataFrame1 = df.drop(excludeColumns, axis=1)
print(newDataFrame1)

     sepal length (cm)  sepal width (cm)           Class
0                  4.9               3.0     Iris-setosa
1                  4.7               3.2     Iris-setosa
2                  4.6               3.1     Iris-setosa
3                  5.0               3.6     Iris-setosa
4                  5.4               3.9     Iris-setosa
..                 ...               ...             ...
144                6.7               3.0  Iris-virginica
145                6.3               2.5  Iris-virginica
146                6.5               3.0  Iris-virginica
147                6.2               3.4  Iris-virginica
148                5.9               3.0  Iris-virginica

[149 rows x 3 columns]


5. Create a new dataset containing the setal width and length and the type of Flower encoded as a categorical numerical column


In [106]:
excludeColumns = ["petal length (cm)", "petal width (cm)"]
newDataFrame2 = df.drop(excludeColumns, axis=1)
newDataFrame2['species_encoded'] = newDataFrame2['Class'].astype('category').cat.codes
newDataFrame2.drop(['Class'], axis=1, inplace=True)
print(newDataFrame2.iloc[48,:])
print(newDataFrame2.iloc[70,:])
print(newDataFrame2.iloc[99,:])



sepal length (cm)    5.0
sepal width (cm)     3.3
species_encoded      0.0
Name: 48, dtype: float64
sepal length (cm)    6.1
sepal width (cm)     2.8
species_encoded      1.0
Name: 70, dtype: float64
sepal length (cm)    6.3
sepal width (cm)     3.3
species_encoded      2.0
Name: 99, dtype: float64
