# Background: 
The Iris Dataset contains four features (length and width of sepals and petals) of 50 samples of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). These measures were used to create a linear discriminant model to classify the species. The dataset is often used in data mining, classification and clustering examples and to test algorithms.

## Data Description

Petal Length - in cm

Petal Width - in cm

Sepal Length - in cm

Sepal Width - in cm

Species - Sentosa, Versicolour, and Virginica 

### Getting Started with Pandas:

In [1]:
import pandas as pd 

### Load the dataset
The Iris flower data set or Fisher’s Iris data set is a multivariate data set introduced by the British statistician and biologist Ronald Fisher in his 1936 paper. This is a very famous and widely used dataset by everyone trying to learn machine learning. 

The dataset is available in the [UCI Machine Learning repository](https://archive.ics.uci.edu/ml/datasets/iris). We can download the data directly from the internet, using the url.

In [2]:
data = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data', header = None)

### Check if the dataset has been downloaded correctly

In [4]:
# Looking at the first few rows
data.head()

Unnamed: 0,0,1,2,3,4
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


The imported datset does not have correct column headers. It is using the first row as the column header. Let us add column headers to this data frame.  

### Add column names/headers in the dataframe

In [5]:
data.columns = ['Petal Length','Petal Width','Sepal Length','Sepal Width','Species']

In [6]:
# Looking at the first few rows again
data.head()

Unnamed: 0,Petal Length,Petal Width,Sepal Length,Sepal Width,Species
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


In [7]:
# Looking at the last few rows of the data frame
data.tail()

Unnamed: 0,Petal Length,Petal Width,Sepal Length,Sepal Width,Species
145,6.7,3.0,5.2,2.3,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica
149,5.9,3.0,5.1,1.8,Iris-virginica


We can save this dataset to our local system as a csv file.

### Export dataframe as csv

In [8]:
data.to_csv('iris.csv', index=False) # Saves the file in the same folder that contains the notebook

Let us now look at the data itself

### Displaying the number of rows randomly

In [9]:
data.sample(10) 

Unnamed: 0,Petal Length,Petal Width,Sepal Length,Sepal Width,Species
48,5.3,3.7,1.5,0.2,Iris-setosa
141,6.9,3.1,5.1,2.3,Iris-virginica
132,6.4,2.8,5.6,2.2,Iris-virginica
76,6.8,2.8,4.8,1.4,Iris-versicolor
123,6.3,2.7,4.9,1.8,Iris-virginica
2,4.7,3.2,1.3,0.2,Iris-setosa
49,5.0,3.3,1.4,0.2,Iris-setosa
120,6.9,3.2,5.7,2.3,Iris-virginica
137,6.4,3.1,5.5,1.8,Iris-virginica
70,5.9,3.2,4.8,1.8,Iris-versicolor


### Check out the shape of the dataset

In [10]:
data.shape

(150, 5)

The dataset has 149 rows of observations and 5 columns.

### Slicing the rows
If you want to print or work upon a particular group of lines that is from say 10th row to 20th row.

In [11]:
# data[start:end] 
# start is inclusive whereas end is exclusive 
print(data[10:21]) 
# it will print the rows from 10 to 20. 
  
# you can also save it in a variable for further use in analysis 
sliced_data=data[10:21] 
print(sliced_data) 

    Petal Length  Petal Width  Sepal Length  Sepal Width      Species
10           5.4          3.7           1.5          0.2  Iris-setosa
11           4.8          3.4           1.6          0.2  Iris-setosa
12           4.8          3.0           1.4          0.1  Iris-setosa
13           4.3          3.0           1.1          0.1  Iris-setosa
14           5.8          4.0           1.2          0.2  Iris-setosa
15           5.7          4.4           1.5          0.4  Iris-setosa
16           5.4          3.9           1.3          0.4  Iris-setosa
17           5.1          3.5           1.4          0.3  Iris-setosa
18           5.7          3.8           1.7          0.3  Iris-setosa
19           5.1          3.8           1.5          0.3  Iris-setosa
20           5.4          3.4           1.7          0.2  Iris-setosa
    Petal Length  Petal Width  Sepal Length  Sepal Width      Species
10           5.4          3.7           1.5          0.2  Iris-setosa
11           4.8    

### Displaying only specific columns

In [16]:
# Select columns Petal Width and Species from iris data
# we will save it in a another variable named "specific_data" 
  
specific_data = data[["Petal Width","Species"]] 
# data[["column_name1","column_name2","column_name3"]] 
  
# now we will print the first 10 columns of the specific_data dataframe. 
print(specific_data.head(10)) 

   Petal Width      Species
0          3.0  Iris-setosa
1          3.2  Iris-setosa
2          3.1  Iris-setosa
3          3.6  Iris-setosa
4          3.9  Iris-setosa
5          3.4  Iris-setosa
6          3.4  Iris-setosa
7          2.9  Iris-setosa
8          3.1  Iris-setosa
9          3.7  Iris-setosa


### Calculating sum, mean, median and mode of a particular column

In [12]:
# data["column_name"].sum() 
  
sum_data = data["Sepal Length"].sum() 
mean_data = data["Sepal Length"].mean() 
median_data = data["Sepal Length"].median() 
mode_data = data["Sepal Length"].mode() 
  
print("Sum:",sum_data, "\nMean:", mean_data, "\nMedian:",median_data, "\nMode:",mode_data) 

Sum: 563.8 
Mean: 3.7586666666666693 
Median: 4.35 
Mode: 0    1.5
dtype: float64


### Calculating sum, mean and mode of a particular Species

In [13]:
# Species == 'Iris-setosa'

sum_data_sentosa = data.loc[data["Species"] == 'Iris-setosa', "Sepal Length"].sum() 
mean_data_sentosa = data.loc[data["Species"] == 'Iris-setosa', "Sepal Length"].mean() 
median_data_sentosa = data.loc[data["Species"] == 'Iris-setosa', "Sepal Length"].median() 
mode_data_sentosa = data.loc[data["Species"] == 'Iris-setosa', "Sepal Length"].mode() 
  
print("Sum:",sum_data_sentosa, "\nMean:", mean_data_sentosa, "\nMedian:",median_data_sentosa, "\nMode:",mode_data_sentosa) 

Sum: 73.2 
Mean: 1.464 
Median: 1.5 
Mode: 0    1.5
dtype: float64


In [17]:
!pip install kivy


Collecting kivy
  Downloading Kivy-2.0.0-cp38-cp38-manylinux2010_x86_64.whl (22.2 MB)
[K     |████████████████████████████████| 22.2 MB 9.1 kB/s eta 0:00:01    |█████████▌                      | 6.6 MB 644 kB/s eta 0:00:25     |██████████▍                     | 7.2 MB 648 kB/s eta 0:00:24     |███████████                     | 7.6 MB 457 kB/s eta 0:00:32     |██████████████                  | 9.8 MB 511 kB/s eta 0:00:25     |████████████████▋               | 11.5 MB 776 kB/s eta 0:00:14     |████████████████████▋           | 14.3 MB 312 kB/s eta 0:00:26     |████████████████████▊           | 14.4 MB 312 kB/s eta 0:00:25     |████████████████████████▌       | 17.0 MB 1.5 MB/s eta 0:00:04     |██████████████████████████▋     | 18.5 MB 778 kB/s eta 0:00:05     |████████████████████████████▍   | 19.7 MB 880 kB/s eta 0:00:03     |█████████████████████████████   | 20.1 MB 880 kB/s eta 0:00:03
Collecting Kivy-Garden>=0.1.4
  Downloading kivy-garden-0.1.4.tar.gz (6.8 kB)
Collecting docutils
 

groupby function is very helpful when we want to analyse such information in the data.
Please try it on this dataset to practice.

We will discuss group by and several other data manipulation functions in the next session.