<a href="https://colab.research.google.com/github/ivanozono/DescriptiveStatistics/blob/main/(1)Types_of_data_in_descriptive_statistics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Types of data in descriptive statistics

---

In statistics, it is crucial to understand the different types of data that you can find. Data can be qualitative or quantitative, and each one has its own methods of analysis.

---

## Qualitative data
These are data that describe characteristics or qualities that cannot be measured with numbers. Examples: eye color, type of housing, car brand.

## Quantitative data
These are data that can be measured with numbers. They can be of two types: discrete and continuous. Discrete data are countable (example: number of children), while continuous data can take any value within a specific range (example: weight, height).

In this notebook, we will explore these concepts using the Pandas library in Python and an integrated dataset.



---

**Loading and Displaying the Iris Dataset**

---

In [2]:
# Importing necessary libraries
import pandas as pd

# Loading the built-in iris dataset from seaborn
iris = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')

# Displaying the first few rows of the dataset
iris.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa




The purpose of this code is to import necessary libraries, load the well-known iris dataset from an online source, and then display its first few rows.

---

**Code Explanation:**

1. **Importing the Required Library:**
   
    The `pandas` library, which is frequently used for data manipulation and analysis in Python, is imported.

2. **Loading the Dataset:**
  
    Using the `read_csv()` function from the `pandas` library, the iris dataset is loaded directly from a raw GitHub link. This function reads a comma-separated values (csv) file into a DataFrame.

3. **Displaying the Initial Dataset Entries:**
   
    The `head()` function displays the first five rows of the dataset by default. This is useful to get a quick glimpse of the dataset's structure and the type of data it contains.







The dataset we are using is the famous Iris dataset. It contains measurements for 150 iris flowers from three different species.

The three classes in the Iris dataset are:

1. Iris-setosa (n=50)
2. Iris-versicolor (n=50)
3. Iris-virginica (n=50)

And the four features of in Iris dataset are:

1. sepal length in cm
2. sepal width in cm
3. petal length in cm
4. petal width in cm

Let's explore the data a bit more.

---

**Inspecting the Shape of the Iris Dataset**

---

In [3]:
# Checking the shape of the dataset
iris.shape

(150, 5)



The purpose of this code is to determine the number of rows and columns in the iris dataset.

---

**Code Explanation:**

1. **Determining the Dataset's Shape:**
  
    The `shape` attribute of a DataFrame returns a tuple representing the dimensions of the DataFrame. Specifically, it provides the number of rows and columns.



Understanding the size and shape of your dataset is fundamental in data analysis. This initial check provides insight into the volume of data you're working with and can inform decisions on data splitting, sampling, or the applicability of certain analytical methods.

The dataset contains 150 rows and 5 columns. Each row corresponds to a single flower. The columns correspond to the features of the flower and its species.

Let's check the data types of the columns.

---

**Inspecting Data Types of the Iris Dataset Columns**

---

In [4]:
# Checking the data types of the columns
iris.dtypes

sepal_length    float64
sepal_width     float64
petal_length    float64
petal_width     float64
species          object
dtype: object



The aim of this code snippet is to ascertain the data types of each column in the iris dataset.

---

**Code Explanation:**

1. **Evaluating the Data Types of Columns:**
   
    Using the `dtypes` attribute of a DataFrame, you can obtain the data types of each column. This attribute returns a Series with the data type of each column. It helps to ensure that each column's data is stored in an appropriate format, which is crucial for subsequent data processing or analysis tasks.



Understanding column data types is essential in data analysis. It can influence decisions on data processing, visualization, and modeling. For instance, you wouldn't compute a mean on a text column, and you'd handle categorical data differently than continuous numerical data in many machine learning algorithms.

The features (sepal length, sepal width, petal length, petal width) are all of type float, and the species column is of type object, which is used for string or text data in pandas.

This means that our dataset contains both quantitative (the features) and qualitative (the species) data.

Let's do some basic data analysis.

---

**Analyzing Basic Statistics of the Iris Dataset**

---

In [5]:
# Checking the basic statistics of the quantitative data
iris.describe()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
count,150.0,150.0,150.0,150.0
mean,5.843333,3.057333,3.758,1.199333
std,0.828066,0.435866,1.765298,0.762238
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5




The objective of this code segment is to get a summary of the basic statistical measures for the quantitative (numeric) columns in the iris dataset.

---

**Code Explanation:**

1. **Computing Basic Statistical Measures:**
   
    The `describe()` method of a DataFrame returns a summary of the central tendency, dispersion, and shape of the distribution of a dataset’s numerical columns. By default, it excludes columns of type `object`.

    - `count`: Number of non-missing values for each column.
    - `mean`: The average value.
    - `std`: Standard deviation, which measures the amount of variation or dispersion of a set of values.
    - `min`: Minimum value in the column.
    - 25%, 50%, and 75%: These are the 25th (first quartile), 50th (median), and 75th (third quartile) percentiles, respectively.
    - `max`: Maximum value in the column.



Having a snapshot of these basic statistics is beneficial for any data analysis task. It offers an initial feel of the data, detects potential outliers (by observing min/max values in relation to the quartiles), and assists in decision-making concerning data normalization or standardization.



Now, let's check the distribution of the qualitative data, i.e., the species.

---

**Analyzing Distribution of Species in the Iris Dataset**

---

In [7]:
# Checking the distribution of the species
iris['species'].value_counts()

setosa        50
versicolor    50
virginica     50
Name: species, dtype: int64



The aim of this code segment is to examine the frequency distribution of the different species in the iris dataset.

---

**Code Explanation:**

1. **Evaluating Species Distribution:**
   
    The `value_counts()` method is applied on a pandas Series to get a tally of unique values. In this case, we are applying it on the `species` column of the iris dataset to understand the distribution of different species.

    The result will show how many samples of each species are present in the dataset. The species name will be the index, and the count of samples will be the corresponding value.



Understanding the distribution of categorical data, such as species in this instance, is crucial in various data analysis contexts. It helps ensure balanced samples, especially in machine learning, where having uneven samples might introduce bias in model training and evaluation.

The dataset is balanced, meaning there are equal numbers of samples from each species (50 each of setosa, versicolor, and virginica).

---

This concludes our basic exploration of the dataset. We've seen that it contains both qualitative and quantitative data, and we've examined the basic statistics of these features.