In [None]:
!wget -O iris.csv https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data > /dev/null 2>&1

# Load the data
Let's break down the command:

!: The exclamation mark at the beginning is often used in environments like Jupyter notebooks to run shell commands. It might not be necessary if you're already in a Unix-like shell.

wget: This is a free utility for non-interactive download of files from the web. It supports HTTP, HTTPS, and FTP protocols, as well as retrieval through HTTP proxies.

-O iris.csv: The -O option in wget command is used to specify the name of the file that the content is downloaded to. In this case, the content being downloaded will be saved as iris.csv.

https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data: This is the URL of the file that you are downloading. It appears to be a dataset on iris flowers from the UCI Machine Learning Repository.

\> /dev/null 2>&1: This is redirecting the output of the command. > /dev/null redirects the standard output (stdout) to /dev/null, which discards it. 2>&1 then redirects the standard error (stderr, denoted by the number 2) to stdout (denoted by the number 1), effectively discarding both standard output and standard error by sending them to /dev/null. This is commonly used to suppress output from a command.

So, put all together, this command downloads the file at the given URL and saves it as iris.csv, suppressing any output from the command.

In [None]:
import pandas as pd

df = pd.read_csv('iris.csv')
df

Using the pandas library to load a CSV file into a DataFrame. Here's a breakdown of each line:

1. `import pandas as pd`: This imports the pandas library and gives it the alias `pd`. Pandas is a widely used data manipulation library in Python. It's particularly good at handling tabular data with heterogeneously-typed columns.

2. `df = pd.read_csv('iris.csv')`: The `read_csv` function is used to read a CSV (comma-separated values) file and convert it into a DataFrame. In this case, the CSV file 'iris.csv' is loaded into a DataFrame. The resulting DataFrame is stored in the variable `df`.

3. `df`: This line is just calling the variable `df` which is your DataFrame. If this script is run in a Python interpreter or Jupyter notebook, it will print the DataFrame to the console. Depending on the environment, it might show the whole DataFrame or just the first few and last few rows if the DataFrame is large.

In [None]:
# Display the first 5 lines
df.head()

In [None]:
# Randomly sample the dataframe
df.sample(10)

In [None]:
# Some simple statistics to describe the dataframe
df.describe()

In [None]:
# What are the column names
df.columns

Umm... that doesnt look right.  Meant to have

   1. sepal length in cm
   2. sepal width in cm
   3. petal length in cm
   4. petal width in cm
   5. class: 
      - Iris Setosa
      - Iris Versicolour
      - Iris Virginica

Lets reload the data witht he column names


In [None]:
columns = ['Sepal Length', 'Sepal Width', 'Petal Length', 'Petal Width', 'Species']
df = pd.read_csv('iris.csv', names=columns)
df

In [None]:
# Let's see how many examples we have of each species as well as the data type for 
# each species
df["Species"].value_counts()

In [None]:
# Ayy missing values?  
df.isnull()

The describe() method returns description of the data in the DataFrame.

If the DataFrame contains numerical data, the description contains these information for each column:

count - The number of not-empty values.
mean - The average (mean) value.
std - The standard deviation.
min - the minimum value.
25% - The 25% percentile*.
50% - The 50% percentile*.
75% - The 75% percentile*.
max - the maximum value.

In [None]:
df.describe()

Pandas is a popular data manipulation library in Python that provides a variety of functions for data analysis. Pandas can create several types of plots such as line plots, bar plots, scatter plots, histograms, box plots, pie charts, and area charts¹³. 

You can use the `plot()` method in Pandas to create these plots. The `plot()` method is a wrapper around the Matplotlib library and provides a simple interface for creating plots¹. 

Here are some examples of how to create different types of plots using Pandas:

- Line plot: `df.plot(kind='line')`
- Bar plot: `df.plot(kind='bar')`
- Scatter plot: `df.plot(kind='scatter')`
- Histogram: `df.plot(kind='hist')`
- Box plot: `df.plot(kind='box')`
- Pie chart: `df.plot(kind='pie')`
- Area chart: `df.plot(kind='area')`¹³ 


In [None]:
df.plot(kind='box')

Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics. Seaborn helps you explore and understand your data. It is built on top of matplotlib and integrates closely with pandas data structures. You can use seaborn to visualize data from pandas dataframes or numpy arrays.



In [None]:
import seaborn as sb
# Pair Plot
sb.pairplot(df, hue='Species')

This is a piece of Python code that uses the Seaborn library to create a pair plot. Let's break it down:

1. `import seaborn as sb`: This line imports the Seaburn library and gives it an alias `sb`. Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.

2. `sb.pairplot(df, hue='Species')`: This line creates a pair plot of the DataFrame `df` using the `pairplot` function from Seaborn. 

A pair plot is a matrix of scatterplots where each feature in the dataset is compared with every other feature. It's a quick way to visualize the relationships between each variable in your data. The diagonal of this matrix is filled with histograms or KDE plots of each variable, and the other cells contain scatter plots of the variable combinations.

The `hue` parameter is used to group the data by some categorical variable. In this case, it's grouping by the 'Species' column in the DataFrame. This means that data points will be colored based on their 'Species' value, which can help in distinguishing between different categories in the data.

So, in summary, this code is creating a pair plot of the data in the DataFrame `df`, with data points colored based on their 'Species' value.


Look like there are some seperaiton of species and colrrelation in the data.  Lets investigate a little more

In [None]:
df.corr()

In [None]:
sb.heatmap(df.corr(), annot=True)