
# Khipus.ai
## Applied Statistics with Python
### Descriptive Statistics
### Case Study - Airbnb dataset

<span>© Copyright Notice 2025, Khipus.ai - All Rights Reserved.</span>



### Introduction
In this notebook, we will explore descriptive statistics using the `airnb.csv` dataset. We will cover:
- Frequency distribution tables
- Visualizations:Frequency Distribution Tables Bar charts, Pie charts,Histograms,Scatter plot, Boxplot
- Measures of Central Tendency: Mean, Median, Mode
- Dispersion Metrics: Range, Variance, Standard Deviation, Percentiles
- Correlation


In [62]:
import numpy as np  # Import the numpy library for numerical operations
import pandas as pd  # Import the pandas library for data manipulation and analysis
import matplotlib.pyplot as plt  # Import the matplotlib library for data visualization

Note: The Airbnb dataset is a collection of data that provides insights into Airbnb listings and bookings. This dataset is widely used in data analysis and machine learning projects to study trends, pricing strategies, user behavior, and more. 

In [None]:
# Load the dataset
data = pd.read_csv('airnb.csv')  # Read the dataset into a pandas DataFrame
data.head()  # Show the first 5 rows of the dataset to get an overview

### Frequency Distribution Tables

In [None]:

### Frequency Distribution Tables

# Example: Frequency distribution of 'Number of bed'
frequency_table = data['Number of bed'].value_counts()  # Calculate the frequency of each unique value in the 'Number of bed' column
frequency_table  # Display the frequency table


### Bar Charts

In [None]:
### Bar Charts

# Bar chart for 'Number of bed'
frequency_table.plot(kind='bar', title='Frequency of Number of Beds')  # Create a bar chart for the frequency table
plt.xlabel('Number of Beds')  # Set the x-axis label
plt.ylabel('Frequency')  # Set the y-axis label
plt.show()  # Display the bar chart



### Pie Charts

In [None]:

# Pie chart for 'Number of bed'
frequency_table.plot(kind='pie', autopct='%1.1f%%', title='Number of Beds Distribution')  # Create a pie chart for the frequency table with percentage labels
plt.ylabel('')  # Remove y-label for better presentation
plt.show()  # Display the pie chart



### Histograms

In [None]:
# Define the price ranges
bins = [0, 100, 200, 300,400,500,600]

# Plot the histogram
data['Price(in dollar)'].hist(bins=bins, edgecolor='black')

# Set the title and labels
plt.title('Distribution of Airbnb Prices')
plt.xlabel('Price (in dollar)')
plt.ylabel('Frequency')

# Display the histogram
plt.show()

### Scatter plot

In [None]:
# Scatter plot for Price and Offer price
plt.scatter(data['Price(in dollar)'], data['Offer price(in dollar)'], alpha=0.5)#The alpha=0.5 parameter in the plt.scatter function sets the transparency level of the scatter plot points. The value of alpha ranges from 0 to 1
plt.title('Scatter Plot of Price vs Offer Price')
plt.xlabel('Price (in dollars)')
plt.ylabel('Offer Price (in dollars)')
plt.show()

### Boxplot

In [None]:
# Generate a boxplot of the 'Price(in dollar)' column
plt.figure(figsize=(10, 6))  # Set the figure size
plt.boxplot(data['Price(in dollar)'].dropna(), vert=False)  # Create a horizontal boxplot, dropping any NaN values
plt.title('Boxplot of Airbnb Prices')  # Set the title of the plot
plt.xlabel('Price (in dollars)')  # Set the x-axis label
plt.show()  # Display the boxplot

### Measures of Central Tendency

In [None]:
# Calculate mean, median, and mode for the Price (in dollars) column
mean_price = data['Price(in dollar)'].mean()
median_price = data['Price(in dollar)'].median()
mode_price = data['Price(in dollar)'].mode()[0]  # [0] is used to extract the first element from the result of the mode() function

mean_price, median_price, mode_price

Mean: 175.26 (average price)

Median: 138.0 (middle value)

Mode: 111 (most frequent price)

### Generate Summary Statistics
Use pandas to generate a summary statistics table for the dataset.

In [9]:
# Generate Summary Statistics

# Generate a summary statistics table for the dataset
summary_statistics = data.describe()
summary_statistics

Unnamed: 0,Price(in dollar),Offer price(in dollar)
count,538.0,95.0
mean,175.256506,150.094737
std,136.357118,111.180711
min,16.0,16.0
25%,90.0,73.0
50%,138.0,132.0
75%,222.0,179.5
max,955.0,610.0


## Measures of Variability

### Range

Range=Maximum Value−Minimum Value

In [12]:
# Calculate the range for 'Price(in dollar)'
range_price = data['Price(in dollar)'].max() - data['Price(in dollar)'].min()
range_price

939

### Percentiles

Common Percentiles:​

25th Percentile (Q1): Lower quartile.​

50th Percentile (Q2): Median.​

75th Percentile (Q3): Upper quartile.

In [11]:
# Calculate the percentiles for the 'Price(in dollar)' column
percentiles = data['Price(in dollar)'].quantile([0.25, 0.5, 0.75])
percentiles

0.25     90.0
0.50    138.0
0.75    222.0
Name: Price(in dollar), dtype: float64

### Variance 
Quantifies how much data varies from the mean.​

### Standard deviation
 Average distance of the data from the mean (square root of variance).​

In [10]:
### Dispersion Metrics

# Variance and Standard Deviation
variance = data['Price(in dollar)'].var()  # Calculate the variance of the 'Price(in dollar)' column
std_dev = data['Price(in dollar)'].std()  # Calculate the standard deviation of the 'Price(in dollar)' column

variance, std_dev  # Display the variance and standard deviation



(18593.26369130443, 136.3571182274854)

### Standard Deviation Visualization

In [None]:
# Create a range of values for visualization
x = np.linspace(mean_price - 3 * std_dev, mean_price + 3 * std_dev, 1000)
y = (1 / (std_dev * np.sqrt(2 * np.pi))) * np.exp(-0.5 * ((x - mean_price) / std_dev)**2)

# Plotting the normal distribution with mean and standard deviation
plt.figure(figsize=(10, 6))
plt.plot(x, y, label='Normal Distribution', color='blue')
plt.axvline(mean_price, color='green', linestyle='--', label='Mean')
plt.axvline(mean_price - std_dev, color='red', linestyle='--', label='1 Std Dev Below Mean')
plt.axvline(mean_price + std_dev, color='red', linestyle='--', label='1 Std Dev Above Mean')

# Add labels and legend
plt.title('Visualization of Standard Deviation', fontsize=16)
plt.xlabel('Price (in dollars)', fontsize=12)
plt.ylabel('Density', fontsize=12)
plt.legend()
plt.grid(True)
plt.show()

### Correlation

In [None]:
# Calculating the correlation between Number of Beds and Price (in dollars)
correlation_beds_price = data[['Number of bed', 'Price(in dollar)']].corr()#Calculate the correlation matrix between 'Price(in dollar)' and 'Number of bed'
correlation_beds_price