# Assignment 1

## Part 1 - Setting up the working environment

In this Assignment you will be creating the foundations for the coming weeks, which includes setting up a Github repository and exploring some data

### 1. Setup

Make sure you have a working installation of Anaconda, Jupyter Notebook and Github

### 2. Creating your Notebook and connecting it to Github

- Create a github repository for the course, name the repo Py3b_firstname_lastname where you change firstname and lastname to your name.

- Clone the repository to your computer somewhere inside the working directory of Jupyter Notebook.

- Inside the repository you just cloned create five folders, one for each week. Name them week1, week2, week3 .. week5.

- Move this notebook to week 1 before moving on.

- Add the teacher/s as collaboratorers on your project.

- For assignment 1 submit a link to your github repo and hand it in. You only need to add the link to your repo, do NOT add the entire folder.

- After completing part 2 of this assignment, push the changes you have made to the Py3b_firstname_lastname folder to github, updating the repository.

## Part 2 - Exploring Iris dataset

The Iris dataset is a classic dataset used in machine learning and statistics. It is a multivariate dataset that contains measurements of the sepal length, sepal width, petal length, and petal width of three different species of iris flowers: Iris setosa, Iris versicolor, and Iris virginica. The dataset is commonly used to study and develop classification algorithms, as the goal is to accurately predict the species of iris flower based on the four features. The Iris dataset is also widely used in data visualization and exploratory data analysis due to its simple and well-defined structure.

The dataset includes three iris species with 50 samples each as well as some properties about each flower. One flower species is linearly separable from the other two, but the other two are not linearly separable from each other.

The columns in this dataset are: Id, SepalLengthCm, SepalWidthCm, PetalLengthCm, PetalWidthCm, Species

The Iris Species dataset can be [found here](https://www.kaggle.com/datasets/uciml/iris)

In this assignment you will explore the data in this dataset to familiarize yourself with it before moving on to training a Machine Learning model on it.

### 1. Import necessary libraries & setup

First off we need to import the Pandas and Matplotlib libraries and set the default figure size for Matplotlib plots.

The first two lines below will import the Pandas and Matplotlib libraries into the environment.

The third line uses the 'rcParams' attribute of the 'plt' object to set the default figure size for Matplotlib plots. The 'rcParams' attribute is a dictionary-like object that contains default settings for various Matplotlib parameters. In this case, we set the 'figure.figsize' parameter to [7, 5], which means that the default figure size for Matplotlib plots will be 7 inches wide and 5 inches tall.

By setting the default figure size, we can ensure that all plots created using Matplotlib in the current session will have the same size and aspect ratio, making it easier to compare and interpret the results.

### 2. Read and explore data

You can load the dataset to a DataFrame to be processed or explored by downloading the CSV file to the same directory as this notebook and using <code>df = pd.read_csv('name_of_your_file.csv')</code>.

<code>df.head()</code> is a method used to view the first n rows of a DataFrame. By default, the method displays the first five rows of the DataFrame. However, the user can pass a number n as an argument to the method to view the first n rows of the DataFrame. <code>df.head(10)</code> would display the first ten rows of the DataFrame.

The method is useful for quickly inspecting the data and getting a sense of its structure, column names, and data types. It is often used as a first step in exploratory data analysis to understand what data the DataFrame contains and how it is structured.

Explore the data in the dataset by loading it with <code>df = pd.read_csv(...)</code> and then using <code>df.head()</code> and at least three of the following methods:

<ul>
    <li><code>df.info()</code> - provides a summary of the DataFrame, including the number of non-null values, data types, and memory usage.</li>
    <li><code>df.describe()</code> - generates descriptive statistics of the DataFrame, such as count, mean, standard deviation, minimum, maximum, and quartiles.</li>
    <li><code>df.shape</code> - returns a tuple of the number of rows and columns in the DataFrame.</li>
    <li><code>df.columns</code> - returns a list of the column names in the DataFrame.</li>
    <li><code>df.dtypes</code> - returns a Series with the data type of each column.</li>
    <li><code>df.isnull().sum()</code> - returns the number of missing values in each column of the DataFrame.</li>
    <li><code>df.nunique()</code> - returns the number of unique values in each column of the DataFrame.</li>
    <li><code>df.value_counts()</code> - returns a Series containing the counts of unique values in each column of the DataFrame.</li>
    <li><code>df.sample()</code> - returns a random sample of rows from the DataFrame.</li>
    <li><code>df.corr()</code> - computes the pairwise correlation of columns in the DataFrame.</li>
</ul>

In [None]:
# Enter your code here
# Importing necessary libraries
import pandas as pd

# Load the dataset
df = pd.read_csv(r"D:\arbets kurs\as6-iris-data.csv")

# Display the first 5 rows of the DataFrame
print("First 5 rows of the DataFrame:")
print(df.head())

# Explore the data using different methods

# 1. Get a summary of the DataFrame
print("\nDataFrame Info:")
print(df.info())

# 2. Generate descriptive statistics
print("\nDescriptive Statistics:")
print(df.describe())

# 3. Check for missing values
print("\nMissing Values in Each Column:")
print(df.isnull().sum())

# 4. Check the shape of the DataFrame
print("\nShape of the DataFrame:")
print(df.shape)

# 5. Display column names
print("\nColumn Names:")
print(df.columns)

# 6. Check the data types of each column
print("\nData Types of Each Column:")
print(df.dtypes)

# 7. Check the number of unique values in each column
print("\nUnique Values in Each Column:")
print(df.nunique())

# 8. Display the pairwise correlation of columns
print("\nCorrelation Matrix:")
print(df.corr())

# 9. Get a random sample of rows from the DataFrame
print("\nRandom Sample of Rows:")
print(df.sample(5))


### 3. Plotting the data

1. Define a dictionary called `species_dict` that maps the three species in the "Species" column to numerical values. The key-value pairs in the dictionary should be as follows: 

```python
'Iris-setosa': 0
'Iris-versicolor': 1
'Iris-virginica': 2
```

2. Use the map() method of the Pandas DataFrame to create a new column called "SpeciesNum" that contains the numerical values for the "Species" column. The mapping should be done using the class_dict dictionary created in step 2. The code for this step should be as follows:

```python
df['SpeciesNum'] = df['Species'].map(species_dict)
```

3. Create a scatter plot of the data using the plot() method of the Pandas DataFrame. Set the kind parameter to 'scatter', the x parameter to 'SepalLengthCm', the y parameter to 'SepalWidthCm', the c parameter to 'SpeciesNum', and the cmap parameter to 'viridis'. 

If you have plotted your data correctly the result should look something like this:

![image-2.png](attachment:image-2.png)

In [None]:
# Enter your code here
import pandas as pd
import matplotlib.pyplot as plt

# Load the dataset
df = pd.read_csv('D:\\arbets kurs\\as6-iris-data.csv')  # Make sure to adjust the path

# Step 1: Define the species dictionary
species_dict = {
    'Iris-setosa': 0,
    'Iris-versicolor': 1,
    'Iris-virginica': 2
}

# Step 2: Map species to numerical values
df['SpeciesNum'] = df['Species'].map(species_dict)

# Step 3: Create a scatter plot
df.plot(kind='scatter', x='SepalLengthCm', y='SepalWidthCm', c='SpeciesNum', cmap='viridis')
plt.title('Sepal Length vs Sepal Width')
plt.xlabel('Sepal Length (cm)')
plt.ylabel('Sepal Width (cm)')
plt.colorbar(label='Species Number')
plt.grid()
plt.show()


## Complete!

If your plot looks like the one above you're done!

Submit your work by pushing the changes to Github, inviting the teacher/s to your repository and submitting the link under the assignment.