In [None]:
# Assessing the missing values in the dataset
missing_values = ods_data.isnull().sum()
percent_missing = (missing_values / len(ods_data)) * 100

# Creating a dataframe to display missing value count and percentage
missing_value_df = pd.DataFrame({
    'Column': missing_values.index,
    'Missing Values': missing_values.values,
    'Percentage': percent_missing.values
})

# Displaying columns with missing values
missing_value_df[missing_value_df['Missing Values'] > 0].sort_values(by='Percentage', ascending=False)
# Dropping columns with 100% missing values and columns with very high percentage of missing values (>90%)
columns_to_drop = missing_value_df[missing_value_df['Percentage'] > 90]['Column']
cleaned_data = ods_data.drop(columns=columns_to_drop)

# Converting date columns to datetime format and ensuring numerical columns are of numeric types
# First, we need to identify the date columns and numeric columns
date_columns = cleaned_data.select_dtypes(include=['datetime']).columns
numeric_columns = cleaned_data.select_dtypes(include=['int64', 'float64']).columns

# No explicit conversion is required if the columns are already in the correct format
# However, if there are specific columns that are not correctly identified, we can convert them manually

# Checking the data types after conversion
cleaned_data[date_columns].dtypes, cleaned_data[numeric_columns].dtypes

# For now, let's proceed to the next step, as it seems columns are in the correct format.
To perform a comprehensive data analysis, we will follow these steps:

1. Assess Missing Values
2. Data Type Conversion
3. Data Consistency Checks
4. In-Depth Analysis and Visualization

Let's begin with the first step.

### 1. Assess Missing Values

We'l start by examining the extent of missing values in each column. This will help us decide how to handle these missing values appropriately.

The analysis of missing values reveals that some columns have a high percentage of missing data. Here's how we can handle them:

1. Columns with 100% Missing Values: These can be removed as they provide no information.
2. Columns with a High Percentage of Missing Values (>90%): These columns are mostly missing and may not be useful for analysis unless they contain critical information.
3. Other Columns with Missing Values: Depending on the importance of these columns and the nature of the missing data, we can consider imputing missing values or using them as is for specific analyses.

### 2. Data Type Conversion

Next, we'll ensure that each column is of the appropriate data type for analysis. This includes converting date columns to datetime format and ensuring numerical columns are of numeric types.

### 3. Data Consistency Checks

We'll also perform checks for data consistency, like looking for outliers or incorrect entries.

After addressing these steps, we can move on to in-depth analysis and visualization. Let's start with cleaning and data type conversion.

The columns are now in appropriate data types:

- Date columns are in datetime64 format.
- Numeric columns are in either int64 or float64 format.

### 4. In-Depth Analysis and Visualization

Now, we'll move on to the in-depth analysis and visualization part. Given the nature of the data, we can explore several aspects such as:

1. School Distribution by City and Area
2. Analysis of School Features (like establishment year, grades offered, etc.)
3. Geographical Distribution (using latitude and longitude)
4. Temporal Trends (if applicable, e.g., in the creation of schools over time)

Please let me know if you have any specific questions or areas of focus for this analysis, or if you'd like me to proceed with these suggested analyses.
import matplotlib.pyplot as plt
import seaborn as sns

# Setting the aesthetic style of the plots
sns.set(style="whitegrid")

# Analysis of School Distribution by City
city_distribution = cleaned_data['city'].value_counts()

# Plotting the distribution of schools by city
plt.figure(figsize=(10, 6))
city_distribution_plot = sns.barplot(x=city_distribution.values, y=city_distribution.index, palette="viridis")
plt.title('Distribution of Schools by City')
plt.xlabel('Number of Schools')
plt.ylabel('City')
plt.show()

# Similarly, we can analyze the distribution by area, but given the large number of unique areas, we might limit to the top areas
top_areas = cleaned_data['area'].value_counts().head(20)  # Top 20 areas

# Plotting the distribution of schools by area
plt.figure(figsize=(10, 6))
area_distribution_plot = sns.barplot(x=top_areas.values, y=top_areas.index, palette="mako")
plt.title('Top 20 Areas by Number of Schools')
plt.xlabel('Number of Schools')
plt.ylabel('Area')
plt.show()
import matplotlib.pyplot as plt
import seaborn as sns

# Setting the aesthetic style of the plots
sns.set(style="whitegrid")

# Analysis of School Distribution by City
city_distribution = cleaned_data['city'].value_counts()

# Plotting the distribution of schools by city
plt.figure(figsize=(10, 6))
city_distribution_plot = sns.barplot(x=city_distribution.values, y=city_distribution.index, palette="viridis")
plt.title('Distribution of Schools by City')
plt.xlabel('Number of Schools')
plt.ylabel('City')
plt.show()

# Similarly, we can analyze the distribution by area, but given the large number of unique areas, we might limit to the top areas
top_areas = cleaned_data['area'].value_counts().head(20)  # Top 20 areas

# Plotting the distribution of schools by area
plt.figure(figsize=(10, 6))
area_distribution_plot = sns.barplot(x=top_areas.values, y=top_areas.index, palette="mako")
plt.title('Top 20 Areas by Number of Schools')
plt.xlabel('Number of Schools')
plt.ylabel('Area')
plt.show()
# Analysis of School Features - Establishment Year
# Filtering out invalid or missing establishment years
valid_est_years = cleaned_data[cleaned_data['establishment_year'].notnull() & 
                               (cleaned_data['establishment_year'] > 0)]

# Plotting the distribution of establishment years
plt.figure(figsize=(10, 6))
sns.histplot(valid_est_years['establishment_year'], kde=False, bins=30, color='skyblue')
plt.title('Distribution of Schools by Establishment Year')
plt.xlabel('Establishment Year')
plt.ylabel('Number of Schools')
plt.show()
# Geographical Distribution of Schools using Latitude and Longitude
# Filtering out invalid or missing coordinates
valid_coordinates = cleaned_data[(cleaned_data['latitude'].notnull()) & 
                                 (cleaned_data['longitude'].notnull())]

plt.figure(figsize=(10, 6))
sns.scatterplot(x=valid_coordinates['longitude'], y=valid_coordinates['latitude'], alpha=0.6)
plt.title('Geographical Distribution of Schools')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.show()
# Temporal Trends - Creation of Schools Over Time
# Extracting year from 'created_at' column
cleaned_data['created_year'] = cleaned_data['created_at'].dt.year

# Filtering out invalid or missing years
valid_created_years = cleaned_data[cleaned_data['created_year'].notnull() & 
                                   (cleaned_data['created_year'] > 0)]

# Counting the number of schools created each year
yearly_creation = valid_created_years['created_year'].value_counts().sort_index()

# Plotting the trend of school creation over the years
plt.figure(figsize=(10, 6))
yearly_creation_plot = sns.lineplot(x=yearly_creation.index, y=yearly_creation.values, marker='o', color='purple')
plt.title('Trend of School Creation Over Years')
plt.xlabel('Year')
plt.ylabel('Number of Schools Created')
plt.xticks(rotation=45)
plt.show()
Let's proceed with the suggested analyses:

1. School Distribution by City and Area
2. Analysis of School Features (like establishment year, grades offered, etc.)
3. Geographical Distribution (using latitude and longitude)
4. Temporal Trends (e.g., creation of schools over time)

### 1. School Distribution by City and Area

We'll start by examining the distribution of schools across different cities and areas. This will give us an insight into the geographical spread of the schools in the dataset.

The visualizations provide insights into the distribution of schools:

- Distribution by City: Shows the number of schools in each city. Some cities have a higher concentration of schools than others.
- Top 20 Areas by Number of Schools: Highlights the areas with the most schools. This can be useful for understanding which areas are educational hubs.

### 2. Analysis of School Features

Next, we'll analyze various school features such as establishment year, grades offered, etc. For this part, we'll focus on the 'establishment_year' and 'grades' columns. We can also explore other features based on your interest. Let's start with the distribution of schools by their establishment year.

The histogram shows the distribution of schools by their establishment year. This visualization helps understand trends in the establishment of schools over time, such as periods with higher or lower numbers of new schools.

### 3. Geographical Distribution (Using Latitude and Longitude)

Now, let's look at the geographical distribution of schools using latitude and longitude data. This will be a scatter plot to visualize the spread of schools geographically.

The scatter plot illustrates the geographical distribution of schools based on their latitude and longitude coordinates. This visualization is useful for identifying clusters of schools and their spread across different regions.

### 4. Temporal Trends (Creation of Schools Over Time)

Finally, let's explore any temporal trends, such as how the creation of schools has varied over time. We'll use the 'created_at' column for this analysis, examining the trend in the number of new school entries in the dataset over the years.

The line plot shows the trend of school creation over the years, based on the 'created_at' column. This trend analysis helps in understanding the growth or decline in the number of schools being added to the dataset over time.

### Summary of In-Depth Analysis

1. School Distribution: We observed the distribution of schools across various cities and top areas.
2. School Features: The establishment years of schools were analyzed to see trends in educational infrastructure development.
3. Geographical Distribution: A scatter plot revealed the geographical spread of schools.
4. Temporal Trends: We examined how the number of new schools recorded in the dataset varied over the years.

If there are any specific aspects of the data you'd like to explore further or any other analyses you have in mind, please let me know!
