# **EDA - Apple AppStore Apps Analysis**
### Dataset Size: 1.2 Million Records


Name : Muhammad Ishfaq Khan \
Email : ishfaqkhan.dev@gmail.com \
GitHub : [ishfaqkhan-dev](https://github.com/ishfaqkhan-dev) \
LinkedIn : https://www.linkedin.com/in/muhammad-ishfaq-khan-666b27370 \
Kaggle : [ishfaqkhandev](https://www.kaggle.com/ishfaqkhandev) \
Twitter : [@ishfaqkhan_dev](https://x.com/ishfaqkhan_dev) 

## **Collaborator**
This analysis was done in collaboration with [@shumailakhan](https://www.kaggle.com/shumailakhan).



## **Introduction**

The Apple App Store hosts over a million iOS applications across diverse categories such as games, education, lifestyle, productivity, and more. Understanding how these apps perform, what factors affect their ratings, and which app types dominate the platform can offer valuable insights for developers, analysts, and business strategists.

In this notebook, I will explore the **Apple App Store dataset** (1.2 million+ apps) to:

- Analyze app ratings, reviews, and pricing trends.  
- Understand category-wise distribution and popularity.  
- Identify top-performing apps based on ratings and downloads.  
- Detect patterns in app sizes, in-app purchases, and developer presence.  
- Clean, preprocess, and visualize the data for meaningful interpretation.



## **Dataset Overview**

The dataset was collected in **October 2021** using a Python-based Scrapy script and made publicly available by **Gautham Prakash**.

🔗 [Apple App Store Dataset](https://www.kaggle.com/datasets/gauthamp10/apple-appstore-apps)

This dataset contains detailed information about more than 1.2 million iOS applications from the Apple App Store. Each row represents a unique app, and the data spans a wide range of fields that give us a deep look into the app ecosystem.

There are 21 columns in total, covering things like app IDs, developer details, pricing, user ratings, update history, and more. For example, we can find out when an app was released or last updated, how big the app is in bytes, what iOS version it requires, and whether it’s free or paid.

What makes this dataset even more interesting is that it includes both overall and current version ratings, along with review counts. This gives us a chance to compare how apps are performing over time. The presence of developer websites and profile links also allows us to look into developer-level activity and app publishing trends.

Overall, it’s a large and well-structured dataset that offers plenty of opportunities for meaningful analysis, especially if you’re curious about how apps are built, maintained, and received by users.


---

## **Getting Started**  
We begin by loading the dataset and importing the libraries we’ll use.

In [None]:
# import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import warnings 
warnings.filterwarnings('ignore')


In [None]:
# load the dataset
df = pd.read_csv('../kaggle_datasets/appleAppData.csv')

## **Checking the Size of the Data**  
Let’s see how many rows and columns are in the dataset.

In [None]:
df.shape

## **Looking at Column Types**  
We check what kind of data each column has, like numbers or text.

In [None]:
df.info()

## **Viewing the First Few Rows**  
This helps us see how the data looks at the start.

In [None]:
df.head()

## **Converting Date Columns**

The `Released` and `Updated` columns contain dates, but they are stored as text. To use them in time-based analysis, we need to convert them into datetime format.


In [None]:
# Convert 'Released' and 'Updated' columns to datetime format
df['Released'] = pd.to_datetime(df['Released'], errors='coerce')
df['Updated'] = pd.to_datetime(df['Updated'], errors='coerce')


## **Exploring Required_IOS_Version Column**

Before cleaning the `Required_IOS_Version` column, let’s check what kind of values it contains. This will help us decide how to clean and convert it properly.


In [None]:
# Check unique values in the Required_IOS_Version column
df['Required_IOS_Version'].unique()


## **Cleaning iOS Version Requirements**

The `Required_IOS_Version` column has many different version formats. Some are simple like `10.0` and `12.1`, while others are longer like `10.15.6` or `11.24`. To make this column easier to analyze, we'll extract just the first two parts (major and minor version) and convert them to float.


In [None]:
# Extract major.minor version only (e.g., 13.2 from 13.2.1 or 13.2.99)
df['Required_IOS_Version'] = df['Required_IOS_Version'].str.extract(r'(\d+\.\d+)')

# Convert to float
df['Required_IOS_Version'] = pd.to_numeric(df['Required_IOS_Version'], errors='coerce')


## **Rechecking Data After Cleaning**

Now that we’ve fixed the date and version columns, let’s check if everything looks good. We’ll use `.info()` to review data types, check for missing values again, and generate a summary of the numeric columns to make sure everything is in place.


In [None]:
# Recheck data types and non-null counts
df.info()


## **Checking for Missing Values and Summary**

Before we move forward, it’s important to see if our dataset has any missing data. We’ll also look at a basic statistical summary to understand the overall distribution of the numeric columns.


In [None]:
# Check how many values are missing in each column
df.isnull().sum().sort_values(ascending=False)


In [None]:
# Get summary of all numeric columns
df.describe()


### **Handling Developer ID Column**

The `DeveloperId` column is numeric, but it acts as an identifier for each developer. It doesn’t carry analytical meaning like average or distribution. So we won’t use it in statistical summaries, but we’ll keep it in the dataset in case we need to group apps by developer later.

> **Note:** This column may still appear in numeric summaries (like `df.describe()`), but we will simply ignore it during analysis because it has no statistical value.


## **Missing Values Analysis**

We identified several columns with missing values. Some of them are optional (like `Developer_Website, Developer_Url`) and can be ignored. Others like `Price`, `Size_Bytes`, and `Required_IOS_Version` are important and need to be handled.

We'll decide case-by-case whether to fill, drop, or leave them as they are.


### **Filling Missing Values**

To prepare our data for analysis, we filled missing values based on the type and importance of each column:

- `Required_IOS_Version`: Filled with the average iOS version required by apps.
- `Price`: If the app is marked as free, we set the price to 0. Otherwise, we used the average price of paid apps.
- `Size_Bytes`: Filled with the median value since size varies a lot and median is less affected by extreme values.


In [None]:
# Fill missing Required_IOS_Version with mean
mean_ios_version = df['Required_IOS_Version'].mean()
df['Required_IOS_Version'].fillna(mean_ios_version, inplace=True)

# Fill missing Price
# If Free == True, set Price = 0. Otherwise, use mean of paid apps
mean_paid_price = df.loc[df['Free'] == False, 'Price'].mean()
df['Price'] = df.apply(
    lambda row: 0 if pd.isna(row['Price']) and row['Free'] else row['Price'], axis=1
)
df['Price'] = df['Price'].fillna(mean_paid_price)

# Fill missing Size_Bytes with median
median_size = df['Size_Bytes'].median()
df['Size_Bytes'].fillna(median_size, inplace=True)


### **Handling Remaining Missing Values**

To finalize the missing value treatment:

- We dropped the few rows where either `App_Name` or `Released` was missing since they are essential fields.
- For optional but useful fields like `Developer_Url` and `Developer_Website`, we filled the missing values with `"Not Provided"` to retain the columns for reference.


In [None]:
# Drop rows with missing App_Name or Released date
df.dropna(subset=['App_Name', 'Released'], inplace=True)

# Fill Developer_Url and Developer_Website with "Not Provided"
df['Developer_Url'].fillna("Not Provided", inplace=True)
df['Developer_Website'].fillna("Not Provided", inplace=True)


### **Final Check for Missing Values**

After filling and dropping the necessary columns, we’ll now recheck the dataset to confirm that there are no remaining missing values.


In [None]:
# Check again for any missing values
df.isnull().sum().sort_values(ascending=False)


## **Dataset Recheck Before Univariate Analysis**

Before we dive into univariate analysis, it's a good idea to double-check the current state of our dataset. We'll review:

- The overall shape (number of rows and columns).
- The data types of each column.
- Whether there are any remaining null values.

This quick check ensures our dataset is properly structured and ready for accurate visualizations and insights.


In [None]:
# Check the current shape of the dataset
print("Dataset Shape:", df.shape)

# Check data types and null values
df.info()


"Now that the dataset has been cleaned and preprocessed (missing values handled, outliers capped or removed, and datatypes corrected), we can move forward to explore it visually and statistically."

## **Univariate Analysis**

In this section, we will explore each individual column from the dataset to understand its distribution, value types, and patterns. Since we have only handled missing values and not dealt with outliers, the analysis reflects the natural state of the data.

We'll begin with some numerical columns to observe how values are distributed, identify any unusual patterns, and summarize key statistics.



### **Primary Genre**

The `Primary_Genre` column represents the main category under which an app is listed in the App Store, such as Games, Productivity, Education, etc. Exploring this helps us understand which categories are most common and which are less frequent.


In [None]:
# Count of apps by genre
plt.figure(figsize=(12, 5))
df['Primary_Genre'].value_counts().head(20).plot(kind='bar')
plt.title('Top 20 Most Common App Genres')
plt.xlabel('Primary Genre')
plt.ylabel('Number of Apps')
plt.xticks(rotation=45)
plt.show()

# Unique genres
print("Number of unique genres:", df['Primary_Genre'].nunique())
print(df['Primary_Genre'].value_counts().head())


### **Content Rating**

The `Content_Rating` column shows the age group suitability of an app (like 4+, 9+, 12+, etc.). Analyzing this helps us identify which age group most apps target.


In [None]:
# Count of apps by content rating
plt.figure(figsize=(8, 4))
df['Content_Rating'].value_counts().plot(kind='bar', color='skyblue')
plt.title('App Count by Content Rating')
plt.xlabel('Content Rating')
plt.ylabel('Number of Apps')
plt.xticks(rotation=0)
plt.show()

# Unique content ratings
print("Content Rating Distribution:")
print(df['Content_Rating'].value_counts())


### **App Size (Bytes)**

The `Size_Bytes` column contains the size of the app in bytes. We will analyze how app sizes are distributed across the dataset, from small lightweight apps to very large ones.


In [None]:
# Convert bytes to megabytes for better readability
df['Size_MB'] = df['Size_Bytes'] / (1024 * 1024)

# Histogram of app sizes
plt.figure(figsize=(10, 4))
sns.histplot(df['Size_MB'], bins=100, kde=True, color='green')
plt.title('Distribution of App Sizes (in MB)')
plt.xlabel('Size (MB)')
plt.ylabel('Number of Apps')
plt.show()

# Basic stats
print(df['Size_MB'].describe())


### **Handling Unrealistic App Sizes**

Some apps in the dataset show unusually large file sizes, going well beyond the expected range for typical mobile applications. While most iOS apps are under 2 GB in size, a few entries in the dataset exceed 3 GB, which is likely due to data errors or scraping anomalies.

To maintain data quality and avoid skewing the analysis, we will cap the app size at a maximum of **3 GB (3 * 1024^3 bytes)**. This ensures that extremely large outliers do not distort visualizations or statistical summaries.


In [None]:
# Define 3 GB in bytes
max_app_size_bytes = 3 * 1024 * 1024 * 1024  # 3 GB = 3,221,225,472 bytes

# Cap sizes greater than 3 GB to the maximum valid size
df.loc[df['Size_Bytes'] > max_app_size_bytes, 'Size_Bytes'] = max_app_size_bytes


In [None]:
# Convert bytes to MB
df['Size_MB'] = df['Size_Bytes'] / (1024 * 1024)

# Plot histogram
plt.figure(figsize=(10, 5))
sns.histplot(df['Size_MB'], bins=100, kde=True, color='teal')
plt.title('Adjusted Distribution of App Sizes (in MB)')
plt.xlabel('App Size (MB)')
plt.ylabel('Number of Apps')
plt.grid(True)
plt.show()


In [None]:
df['Size_MB'].describe()

**Note:** Some app sizes were unrealistically high (e.g., above 50 GB). To keep the data clean and meaningful, I capped the sizes at 3 GB (3072 MB).


### **Required iOS Version**

This column shows the minimum iOS version required to run the app. By analyzing this, we can understand how modern or outdated most apps are in terms of compatibility.


In [None]:
# Count of unique required iOS versions
plt.figure(figsize=(12, 5))
df['Required_IOS_Version'].value_counts().head(20).plot(kind='bar', color='orange')
plt.title('Top 20 Required iOS Versions')
plt.xlabel('iOS Version')
plt.ylabel('Number of Apps')
plt.xticks(rotation=45)
plt.show()

print("Unique iOS Versions:", df['Required_IOS_Version'].nunique())


### **Price**

The `Price` column represents how much users must pay to download the app. Most apps are free, but some have a price tag. We'll analyze the distribution of prices to understand how many apps are paid and how expensive they can get.


In [None]:
# Distribution of app prices
plt.figure(figsize=(10, 5))
sns.histplot(df['Price'], bins=50, kde=True)
plt.title('Distribution of App Prices')
plt.xlabel('Price ($)')
plt.ylabel('Number of Apps')
plt.show()

# Descriptive stats
print(df['Price'].describe())


### **Handling Outliers in Price**

Some apps in the dataset have extremely high prices, going up to $999.99, which is not practical for regular app purchases. To make the analysis meaningful and avoid distortion in graphs, we capped all prices at $100. This keeps the data realistic without losing important trends.


In [None]:
# Apply outlier handling only on paid apps
paid_mask = df['Free'] == False

# Calculate IQR only on paid apps
Q1 = df.loc[paid_mask, 'Price'].quantile(0.25)
Q3 = df.loc[paid_mask, 'Price'].quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Clip only the paid apps' prices
df.loc[paid_mask, 'Price'] = df.loc[paid_mask, 'Price'].clip(lower=lower_bound, upper=upper_bound)



### **Price (Paid Apps)**
Paid apps often have a wide price range. We filtered the dataset to include only paid apps and analyzed the price distribution using boxplots and summary statistics.

In [None]:
# Boxplot for paid apps
plt.figure(figsize=(10, 4))
sns.boxplot(x=df[df['Free'] == False]['Price'], color='orange')
plt.title('Price Distribution - Paid Apps')
plt.xlabel('Price ($)')
plt.show()

# Summary stats for paid apps
df[df['Free'] == False]['Price'].describe()


### **Free Column**
This column indicates whether an app is free or paid. It is a boolean field, useful for categorizing apps for further analysis.

In [None]:
# Count of Free vs Paid
plt.figure(figsize=(5, 3))
sns.countplot(x='Free', data=df, palette='Set2')
plt.title('Distribution of Free vs Paid Apps')
plt.xlabel('Free App (True = Free, False = Paid)')
plt.ylabel('App Count')
plt.show()

# Value counts
df['Free'].value_counts()


### **Average User Rating**

The `Average_User_Rating` column shows the average rating given by users on a scale from 0 to 5. We'll analyze its distribution to understand overall user satisfaction.


In [None]:
# Histogram of user ratings
plt.figure(figsize=(10, 5))
sns.histplot(df['Average_User_Rating'], bins=10, kde=True)
plt.title('Distribution of Average User Ratings')
plt.xlabel('Average Rating')
plt.ylabel('Number of Apps')
plt.show()

# Descriptive stats
print(df['Average_User_Rating'].describe())


### **Check how many apps have 0 rating**

In [None]:
(df['Average_User_Rating'] == 0).sum()


### **Compare zero-rated apps with total**

In [None]:
(df['Average_User_Rating'] == 0).mean() * 100  # percentage


### **Observations: Average User Rating**

More than 55% of the apps have an average user rating of 0.0. This likely indicates that these apps have not been rated by users yet. For analysis, we may either exclude these apps from certain rating-based insights or treat them separately to avoid skewing the results.


In [None]:
df['Average_User_Rating'].isnull().sum()  # check for NaN values

### **Average User Rating (Non-Zero Only)**

Since over half of the apps have a user rating of 0.0, we will analyze only those apps that have received at least one rating. This helps us focus on genuine user feedback and avoid skewing the overall rating distribution.


In [None]:
# Filter out zero ratings
rated_apps = df[df['Average_User_Rating'] > 0]

# Summary statistics
print(rated_apps['Average_User_Rating'].describe())

# Visualize with histogram
plt.figure(figsize=(8, 4))
sns.histplot(rated_apps['Average_User_Rating'], bins=20, kde=True, color='skyblue')
plt.title('Distribution of Average User Ratings (Non-Zero Only)')
plt.xlabel('Average User Rating')
plt.ylabel('Number of Apps')
plt.show()


## **Reviews**

The `Reviews` column tells how many users reviewed the app. Let's explore the spread of review counts to see which apps are getting more attention.


In [None]:
# Histogram of reviews (log scale for better visualization)
plt.figure(figsize=(10, 5))
sns.histplot(df['Reviews'], bins=50, kde=True, log_scale=(True, False))
plt.title('Distribution of Review Counts (Log Scale)')
plt.xlabel('Number of Reviews')
plt.ylabel('Number of Apps')
plt.show()

# Descriptive stats
print(df['Reviews'].describe())


### **Reviews**

The `Reviews` column represents the number of user reviews each app has received. Upon inspecting the distribution, we observed that the majority of apps have very few or no reviews at all. In fact, both the 25th percentile and the median are `0`, indicating that over 50% of apps have not received any user reviews.

To handle this heavily skewed distribution and improve visualization, we applied a log transformation using `log1p` (log(1 + Reviews)). This method retains zero values while scaling down extremely large numbers, making the data more suitable for further analysis and plotting.


In [None]:
# Create a log-transformed version of the Reviews column
df['Log_Reviews'] = np.log1p(df['Reviews'])

# Visualize the transformed distribution
plt.figure(figsize=(10, 5))
sns.histplot(df['Log_Reviews'], bins=50, kde=True, color='skyblue')
plt.title('Distribution of Log-Transformed Reviews')
plt.xlabel('Log(1 + Reviews)')
plt.ylabel('Number of Apps')
plt.show()

# Check summary of new column
print(df['Log_Reviews'].describe())


### **Log-Transformed Reviews**

After applying log transformation to the `Reviews` column, we observed that the distribution is still skewed, but now more interpretable. 

- Most apps still have zero reviews (25th and 50th percentiles are 0).
- The 75th percentile is approximately 1.38 (which is log(4)).
- The maximum is around 16.93 (which is log(23 million)).

This transformation helps compress the extreme values, making it easier to analyze apps with varying review counts while preserving the zero-review entries.


### **Current Version Score**

This column shows the average rating of the app’s latest version, rated by users. The values range from 0 to 5.

We will check its distribution and basic statistics to understand how most app versions are rated.


In [None]:
# Histogram of Current Version Score
plt.figure(figsize=(8, 4))
sns.histplot(df['Current_Version_Score'], bins=20, kde=True)
plt.title('Distribution of Current Version Score')
plt.xlabel('Current_Version_Score')
plt.ylabel('Number of Apps')
plt.show()

# Descriptive stats
print(df['Current_Version_Score'].describe())


### **Current Version Reviews**

This column shows how many users reviewed the current version of each app. The range is wide and might include extreme outliers. Many apps have zero reviews.

We will apply a log transformation to make the review counts easier to analyze.


In [None]:
# Log-transform for skewed review count
df['Log_Current_Version_Reviews'] = np.log1p(df['Current_Version_Reviews'])

# Histogram of log-transformed reviews
plt.figure(figsize=(8, 4))
sns.histplot(df['Log_Current_Version_Reviews'], bins=50, kde=True)
plt.title('Log Distribution of Current Version Reviews')
plt.xlabel('Log(Current_Version_Reviews)')
plt.ylabel('Number of Apps')
plt.show()

# Descriptive stats
print(df['Log_Current_Version_Reviews'].describe())


## **Normality Check of Numerical Features**

Before applying advanced statistical or machine learning techniques, it's important to understand the distribution of numerical variables. Here, we assess the normality of selected features like `Price`, `Size_Bytes`, and `Reviews`.

For each feature:
- A histogram is plotted to visualize the distribution.
- A normal distribution curve (based on mean and standard deviation) is overlaid.
- This helps identify skewness, extreme peaks, or heavy tails in the data.

Large deviations from normality may suggest the need for transformations (like log-scaling) or robust models.


In [None]:
# import scipy for statistical functions
from scipy import stats

num_cols = ['Price', 'Size_Bytes', 'Reviews']  # pehle 3 columns hi lo
for col in num_cols:
    plt.figure(figsize=(6, 4))
    sns.histplot(df[col], bins=50, kde=True, stat="density")
    mean = df[col].mean()
    std = df[col].std()
    xmin, xmax = plt.xlim()
    x = np.linspace(xmin, xmax, 100)
    p = stats.norm.pdf(x, mean, std)
    plt.plot(x, p, 'k', linewidth=2)
    plt.title(f'Normality Check: {col}')
    plt.xlabel(col)
    plt.ylabel('Density')
    plt.show()

Below we visualize:
- `Current_Version_Score`
- `Current_Version_Reviews`
- `Average_User_Rating`

In [None]:

# List of columns to check
columns = ['Current_Version_Score', 'Current_Version_Reviews', 'Average_User_Rating']

# Plot normality check for each
for col in columns:
    plt.figure(figsize=(8, 4))
    sns.histplot(df[col], bins=50, kde=True, stat="density", color='skyblue', edgecolor='black')

    # Overlay normal distribution
    mean = df[col].mean()
    std = df[col].std()
    xmin, xmax = plt.xlim()
    x = np.linspace(xmin, xmax, 100)
    p = stats.norm.pdf(x, mean, std)
    plt.plot(x, p, 'r', linewidth=2)

    plt.title(f'Normality Check: {col}')
    plt.xlabel(col)
    plt.ylabel('Density')
    plt.grid(True)
    plt.tight_layout()
    plt.show()

## **Bivariate Analysis**

In this section, we analyze the relationship between two variables to discover possible patterns, trends, or correlations. This helps in understanding how one feature might influence or relate to another, especially between numerical and categorical variables like Price, Reviews, Ratings, etc.


### **Price vs. Average User Rating**
This analysis helps us understand whether paid apps tend to get higher or lower ratings compared to free ones.

In [None]:
plt.figure(figsize=(10, 5))
sns.scatterplot(data=df, x='Price', y='Average_User_Rating', alpha=0.3)
plt.title('Price vs Average User Rating')
plt.xlabel('Price ($)')
plt.ylabel('Average User Rating')
plt.show()


### **Price vs Reviews**
We examine whether higher-priced apps tend to receive more or fewer user reviews.

In [None]:
plt.figure(figsize=(10, 5))
sns.scatterplot(data=df, x='Price', y='Reviews', alpha=0.3)
plt.title('Price vs Number of Reviews')
plt.xlabel('Price ($)')
plt.ylabel('Reviews')
plt.yscale('log')
plt.show()


### **Price vs Current_Version_Score**
This helps analyze if pricing has any relationship with how well users rate the current version of the app.

In [None]:
plt.figure(figsize=(10, 5))
sns.scatterplot(data=df, x='Price', y='Current_Version_Score', alpha=0.3)
plt.title('Price vs Current Version Score')
plt.xlabel('Price ($)')
plt.ylabel('Current Version Score')
plt.show()


### **Size_Bytes vs Average_User_Rating**
We check if app size has any relationship with user ratings.

In [None]:
plt.figure(figsize=(10, 5))
sns.scatterplot(data=df, x='Size_Bytes', y='Average_User_Rating', alpha=0.3)
plt.title('App Size vs Average User Rating')
plt.xlabel('Size (in Bytes)')
plt.ylabel('Average User Rating')
plt.xscale('log')
plt.show()


### **Reviews vs Average_User_Rating**
Helps us understand if popular apps (with more reviews) are rated higher or not.

In [None]:
plt.figure(figsize=(10, 5))
sns.scatterplot(data=df, x='Reviews', y='Average_User_Rating', alpha=0.3)
plt.title('Reviews vs Average User Rating')
plt.xlabel('Reviews')
plt.ylabel('Average User Rating')
plt.xscale('log')
plt.show()


### **Current_Version_Score vs Current_Version_Reviews**
We check whether more recent versions with more reviews tend to get better scores.

In [None]:
plt.figure(figsize=(10, 5))
sns.scatterplot(data=df, x='Current_Version_Reviews', y='Current_Version_Score', alpha=0.3)
plt.title('Current Version Reviews vs Score')
plt.xlabel('Current Version Reviews')
plt.ylabel('Current Version Score')
plt.xscale('log')
plt.show()


### **Reviews vs Current_Version_Reviews**
This helps see if current version reviews are proportional to total reviews.

In [None]:
plt.figure(figsize=(10, 5))
sns.scatterplot(data=df, x='Reviews', y='Current_Version_Reviews', alpha=0.3)
plt.title('Total Reviews vs Current Version Reviews')
plt.xlabel('Total Reviews')
plt.ylabel('Current Version Reviews')
plt.xscale('log')
plt.yscale('log')
plt.show()


### **Average_User_Rating vs Current_Version_Score**
Helps assess consistency between overall and current version ratings.

In [None]:
plt.figure(figsize=(10, 5))
sns.scatterplot(data=df, x='Average_User_Rating', y='Current_Version_Score', alpha=0.3)
plt.title('Overall vs Current Version Rating')
plt.xlabel('Average User Rating')
plt.ylabel('Current Version Score')
plt.show()


## **Multivariate Analysis: Ratings by Genre and App Type**
In this chart, we compare the average user rating for each genre based on whether the app is free or paid. This helps identify how pricing might influence user satisfaction across different categories.

In [None]:
plt.figure(figsize=(14, 6))
sns.barplot(data=df, x='Primary_Genre', y='Average_User_Rating', hue='Free', ci=None)
plt.xticks(rotation=45, ha='right')
plt.title('Average User Rating by Genre and App Type (Free vs Paid)')
plt.xlabel('App Genre')
plt.ylabel('Average User Rating')
plt.legend(title='Free App')
plt.tight_layout()
plt.show()


## **Multivariate Analysis: Price and Reviews vs User Ratings**
This scatter plot shows how app price and number of reviews relate to user ratings. It can reveal if paid apps or popular apps get higher ratings.

In [None]:
plt.figure(figsize=(12, 6))
sns.scatterplot(data=df[df['Price'] > 0], x='Price', y='Average_User_Rating',
                size='Reviews', hue='Primary_Genre', alpha=0.6, legend=False)
plt.title('Paid Apps: Price vs User Rating (Bubble Size = Reviews)')
plt.xlabel('Price ($)')
plt.ylabel('Average User Rating')
plt.tight_layout()
plt.show()


## **Top 10 Most Reviewed Apps (Unique)**
We group the apps by name and pick the one with the highest number of reviews, ensuring each app appears only once.



In [None]:
# Drop duplicate app names keeping the one with highest reviews
top_reviewed = df.sort_values('Reviews', ascending=False).drop_duplicates(subset='App_Name')

# Select top 10 by reviews
top_10_reviewed = top_reviewed[['App_Name', 'Reviews', 'Average_User_Rating']].head(10)
print("Top 10 Apps by Reviews:")
print(top_10_reviewed)


## **Top 10 Highest Rated Apps (Unique)**
We now select top-rated apps (minimum 50 reviews to avoid noise) and ensure no duplicates.

In [None]:
# Drop duplicate app names keeping the highest rated
top_rated = df[df['Reviews'] >= 50].sort_values('Average_User_Rating', ascending=False).drop_duplicates(subset='App_Name')

# Select top 10 by rating
top_10_rated = top_rated[['App_Name', 'Average_User_Rating', 'Reviews']].head(10)
print("Top 10 Apps by Average Rating (min 50 reviews):")
print(top_10_rated)


## **Top Rated App in Each Category**
Here we group by category and pick the highest rated app per category.

In [None]:
# Drop duplicates to get highest rated app per category
top_category_apps = df[df['Reviews'] >= 50].sort_values('Average_User_Rating', ascending=False).drop_duplicates(subset='Primary_Genre')

# Show relevant columns
category_top = top_category_apps[['Primary_Genre', 'App_Name', 'Average_User_Rating', 'Reviews']].sort_values(by='Average_User_Rating', ascending=False)
print("Top Rated Apps by Category:")
print(category_top.head(10))  # Show top 10 categories


## **Correlation Between Numerical Features**
We analyze how numeric columns like Ratings, Reviews, Size, and Price are related to each other.

In [None]:
# Correlation matrix
plt.figure(figsize=(10, 6))
numeric_cols = ['Price', 'Size_Bytes', 'Average_User_Rating', 'Reviews', 'Current_Version_Score', 'Current_Version_Reviews']
correlation_matrix = df[numeric_cols].corr()

sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Matrix of Numeric Features')
plt.show()


## **Correlation Matrix Analysis**

The correlation matrix above highlights the relationships between key numeric features in the dataset. Here are the main observations:

- Most feature pairs show **very weak correlation**, suggesting they behave independently.
- `Size_Bytes` has a **slight positive correlation with Price** (0.13), meaning larger apps may cost more.
- `Average_User_Rating` is **not strongly correlated** with Reviews or Price, implying that user satisfaction is not directly influenced by popularity or cost.
- `Reviews`, `Current_Version_Reviews`, and `Current_Version_Score` show **noticeable alignment**, which makes sense since they all relate to user feedback.

This indicates that each variable contributes uniquely and should be studied individually as well as in combination for deeper analysis.


## **Distribution of User Ratings**
This plot shows how user ratings are spread across the dataset.

In [None]:
plt.figure(figsize=(8, 4))
sns.histplot(df['Average_User_Rating'], bins=20, kde=True, color='skyblue')
plt.title('Distribution of Average User Ratings')
plt.xlabel('Rating')
plt.ylabel('Count')
plt.show()


## **Price Distribution of Paid Apps Only**
This helps understand how paid app prices are spread.

In [None]:
plt.figure(figsize=(8, 4))
sns.histplot(df[df['Free'] == False]['Price'], bins=30, color='orange', kde=True)
plt.title('Price Distribution of Paid Apps')
plt.xlabel('Price ($)')
plt.ylabel('Count')
plt.show()


## **Top Categories by App Count**
Which genres have the most apps?

In [None]:
top_categories = df['Primary_Genre'].value_counts().head(10)

plt.figure(figsize=(10, 5))
sns.barplot(x=top_categories.index, y=top_categories.values, palette='viridis')
plt.title('Top 10 App Categories by Number of Apps')
plt.ylabel('Number of Apps')
plt.xlabel('Category')
plt.xticks(rotation=45)
plt.show()


## **Average Rating by Category (Bar Plot)**

In [None]:
avg_rating_by_genre = df.groupby('Primary_Genre')['Average_User_Rating'].mean().sort_values(ascending=False).head(10)

plt.figure(figsize=(10, 5))
sns.barplot(x=avg_rating_by_genre.index, y=avg_rating_by_genre.values, palette='magma')
plt.title('Top 10 Categories by Average User Rating')
plt.ylabel('Average Rating')
plt.xlabel('Category')
plt.xticks(rotation=45)
plt.show()


## **Insights from Ratings and Review Trends**

### **High Ratings with Low Reviews**
Some apps received a perfect **5.0 rating** but had only **50 to 100 reviews**. This indicates early-stage or low-traffic apps that performed well with a small audience.

### **Stable Ratings with High Reviews**
Apps with **millions of reviews** generally had ratings between **4.6 and 4.9**, suggesting consistent user satisfaction over time.

### **Dominant Categories Among Top Apps**
The most common categories among top-rated or most-reviewed apps were **Games**, **Entertainment**, and **Social Networking**. However, **Education**, **Business**, and **Shopping** also appeared, showing wide interest across different app types.

### **Review Counts and Free Apps**
All of the top 10 most-reviewed apps were **free**. Free apps tend to get more downloads, which naturally results in a higher number of reviews.

### **Price Impact on Ratings**
There was no strong correlation between **price and user ratings**. Most highly rated apps were either **free** or very affordable (like $0.99). This suggests users might have higher expectations from paid apps and rate them more critically.


## **Final Thoughts and Summary**

This analysis gave us valuable insights into the Apple App Store dataset, highlighting patterns across app ratings, review counts, pricing, and categories. We observed that free apps tend to attract more users and reviews, while user ratings remain a reliable metric for gauging app quality. Although pricing had little correlation with ratings, it influenced download volume. By combining univariate, bivariate, and multivariate techniques, we now have a strong foundation for predictive modeling, business strategy, or deeper machine learning tasks going forward.
