<h1>Folder Placements</h1>
<ul>
  <li><strong>Weather_Data_Scraping_and_Analysis</strong>
    <ul>
      <li><strong>weather.csv:</strong> Contains the main weather data with columns including city, latitude, and longitude.</li>
      <li><strong>Dataset:</strong> Folder containing CSV files with weather data for different cities.</li>
        <ul>
          <li>[Contains CSV files with weather data for different cities]</li>
        </ul>
      <li><strong>cities_not_scraped.csv:</strong> CSV file containing the list of cities from weather.csv that were not scraped successfully.</li>
      <li><strong>Weather_Data_Scraping_and_Analysis.ipynb:</strong> Jupyter notebook file for data scraping and analysis.</li>
    </ul>
  </li>
</ul>


<h1>Import Libraries</h1>
<ul>
  <li><strong>csv:</strong> To read and write CSV files.</li>
  <li><strong>os:</strong> To interact with the operating system, e.g., file paths.</li>
  <li><strong>time:</strong> To introduce delays.</li>
  <li><strong>requests_cache:</strong> For caching HTTP requests.</li>
  <li><strong>retry:</strong> For retrying failed operations.</li>
  <li><strong>pandas:</strong> For data manipulation and analysis.</li>
  <li><strong>openmeteo_requests:</strong> Custom library for interacting with the Open-Meteo API.</li>
  <li><strong>retry_requests:</strong> Library for retrying failed HTTP requests.</li>
</ul>


In [None]:
# Cell 1: Import necessary libraries
!pip install openmeteo_requests
!pip install retry_requests
!pip install requests_cache

import csv
import os
import time
import requests_cache
from retry_requests import retry
import pandas as pd
import openmeteo_requests



# Cell 2: Rename Duplicates
<div>
    <p>This Python script reads a CSV file containing weather data, identifies duplicate city names, and renames them with a numeric suffix to make them unique. It then writes the modified data to a new CSV file. Here's a breakdown of the script:</p>
    <ol>
        <li>It defines a function <code>rename_duplicates</code> that takes a CSV file path as input.</li>
        <li>Inside the function:
            <ul>
                <li>It initializes an empty dictionary <code>city_count</code> to store counts of each city.</li>
                <li>It initializes an empty list <code>new_rows</code> to store modified rows.</li>
                <li>It opens the input CSV file in read mode.</li>
                <li>It reads the CSV file using <code>csv.DictReader</code> to treat each row as a dictionary.</li>
                <li>It iterates over each row in the CSV file.
                    <ul>
                        <li>For each row:
                            <ul>
                                <li>It retrieves the city name from the 'City' column.</li>
                                <li>If the city name is already in <code>city_count</code>, it increments the count and renames the city with a numeric suffix.</li>
                                <li>If the city name is not in <code>city_count</code>, it adds the city to <code>city_count</code> with a count of 0.</li>
                                <li>It appends the modified row to the <code>new_rows</code> list.</li>
                            </ul>
                        </li>
                    </ul>
                </li>
                <li>It closes the input CSV file.</li>
                <li>It defines the output CSV file path by appending '_modified' to the input file name.</li>
                <li>It opens the output CSV file in write mode.</li>
                <li>It writes the modified data to the output CSV file using <code>csv.DictWriter</code>.</li>
                <li>It closes the output CSV file.</li>
            </ul>
        </li>
    </ol>
</div>


In [None]:
import os

# Path to the input directory
input_dir = '/kaggle/input/'

# List all files in the input directory
input_files = os.listdir(input_dir)

# Print the names of all files
for file_name in input_files:
    print(file_name)


In [None]:
import csv

def rename_duplicates(csv_file):
    
    # Initialize a dictionary to store city counts
    city_count = {}
    new_rows = []

    # Open the CSV file and read its contents
    with open(csv_file, 'r') as file:
        reader = csv.DictReader(file)
        
        # Iterate through each row in the CSV
        for row in reader:
            city = row['City']
            # Check if the city is already in the dictionary
            if city in city_count:
                # If yes, increment the count and rename the city
                city_count[city] += 1
                row['City'] = f"{city}{city_count[city]:02d}"
            else:
                # If no, add the city to the dictionary with count 0
                city_count[city] = 0
            
            # Append the modified row to the new_rows list
            new_rows.append(row)

    # Write the modified data to a new CSV file
    new_csv_file = '/kaggle/working/weather_modified.csv'
    with open(new_csv_file, 'w', newline='') as file:
        writer = csv.DictWriter(file, fieldnames=new_rows[0].keys())
        writer.writeheader()
        writer.writerows(new_rows)

# Call the function to rename duplicates in the weather.csv file
rename_duplicates("/kaggle/input/weather.csv")


<!-- Cell 3 Markdown -->
# Cell 3: Data Scraping Setup
- **Setup**: 
  - Define the number of runs and initialize a dictionary to store scraped cities.
- **Loop**: 
  - Iterate through each run.
- **Open-Meteo API Client**:
  - Setup caching and retry mechanism for API requests.
- **Read Coordinates**:
  - Read city coordinates from the "weather.csv" file.
- **Variables**:
  - Initialize variables to track statistics.


In [None]:
# Cell 3: Set up variables for data scraping

# Setup the Open-Meteo API client with cache and retry on error
cache_session = requests_cache.CachedSession('.cache', expire_after=-1)
retry_session = retry(cache_session, retries=5, backoff_factor=0.2)
openmeteo = openmeteo_requests.Client(session=retry_session)

# Read city coordinates from the "weather.csv" file
weather_data_path = '/kaggle/working/weather_modified.csv'
df = pd.read_csv(weather_data_path)

# Folder to store downloaded datasets
dataset_folder = '/kaggle/working/Dataset'

# Variables to keep track of statistics
total_cities = len(df)
total_downloaded = 0
total_remaining = 0

# Dictionary to store successfully downloaded cities
scraped_cities = {}


<!-- Cell 4 Markdown -->
# Cell 4: Data Scraping
- **Iterate Through Cities**:
  - Iterate through each city in the dataset.
- **API Request**:
  - Make requests to Open-Meteo API for weather data.
- **Process Data**:
  - Process API responses and save data to CSV.
- **Statistics**:
  - Print statistics on total cities, downloaded, and remaining.


In [None]:
import os
import time



# Create the dataset folder if it doesn't exist
dataset_folder = '/kaggle/working/Dataset'
if not os.path.exists(dataset_folder):
    os.makedirs(dataset_folder)

# Define TOTAL_EXISTING
existing_files = os.listdir(dataset_folder)
TOTAL_EXISTING = len([file for file in existing_files if file.endswith('.csv')])

# Number of times to run the loop
num_runs = 2

# Loop to run the code multiple times
for run in range(1, num_runs + 1):
    print(f"\nRun {run} of {num_runs}:")
    # Initialize error counter
    error_count = 0
    # Reset total downloaded for each run
    total_downloaded = 0

    # Iterate through cities
    for i in range(10):  # assuming you want to iterate through 3 cities
        City = df.iloc[i]['City']
        Lat = df.iloc[i]['Lat']
        Lng = df.iloc[i]['Lng']

        # Check if data for the city already exists
        file_name = f'{City}.csv'
        file_path = os.path.join(dataset_folder, file_name)
        if os.path.exists(file_path):
            print(f"Data for {City} already exists. Skipping...")
            continue

        # Specify API request parameters
        url = "https://archive-api.open-meteo.com/v1/archive"
        params = {
            "latitude": Lat,
            "longitude": Lng,
            "start_date": "2010-01-01",
            "end_date": "2024-02-20",
            "hourly": ["temperature_2m", "relative_humidity_2m", "dew_point_2m", "apparent_temperature", "precipitation", "rain", "snowfall", "snow_depth", "pressure_msl", "surface_pressure", "cloud_cover", "cloud_cover_low", "cloud_cover_mid", "cloud_cover_high", "wind_speed_10m", "wind_speed_100m", "wind_direction_10m", "wind_direction_100m", "wind_gusts_10m"]
        }

        # Make API request
        try:
            responses = openmeteo.weather_api(url, params=params)
        except Exception as e:
            print(f"Error fetching data for {City}: {e}")
            error_count += 1  # Increment error count
            time.sleep(5)  # Delay for 5 second before continuing
            continue

        # Print city name as it is downloaded
        print(f"Downloading data for {City}...")

        # Process first location. Add a for-loop for multiple locations or weather models
        response = responses[0]

        # Process hourly data
        hourly = response.Hourly()

        # Process and save data to CSV
        hourly_data = {
            "date": pd.date_range(
                start=pd.to_datetime(hourly.Time(), unit="s", utc=True),
                end=pd.to_datetime(hourly.TimeEnd(), unit="s", utc=True),
                freq=pd.Timedelta(seconds=hourly.Interval()),
                inclusive="left"
            ),
            "temperature_2m": hourly.Variables(0).ValuesAsNumpy(),
            "relative_humidity_2m": hourly.Variables(1).ValuesAsNumpy(),
            "dew_point_2m": hourly.Variables(2).ValuesAsNumpy(),
            "apparent_temperature": hourly.Variables(3).ValuesAsNumpy(),
            "precipitation": hourly.Variables(4).ValuesAsNumpy(),
            "rain": hourly.Variables(5).ValuesAsNumpy(),
            "snowfall": hourly.Variables(6).ValuesAsNumpy(),
            "snow_depth": hourly.Variables(7).ValuesAsNumpy(),
            "pressure_msl": hourly.Variables(8).ValuesAsNumpy(),
            "surface_pressure": hourly.Variables(9).ValuesAsNumpy(),
            "cloud_cover": hourly.Variables(10).ValuesAsNumpy(),
            "cloud_cover_low": hourly.Variables(11).ValuesAsNumpy(),
            "cloud_cover_mid": hourly.Variables(12).ValuesAsNumpy(),
            "cloud_cover_high": hourly.Variables(13).ValuesAsNumpy(),
            "wind_speed_10m": hourly.Variables(14).ValuesAsNumpy(),
            "wind_speed_100m": hourly.Variables(15).ValuesAsNumpy(),
            "wind_direction_10m": hourly.Variables(16).ValuesAsNumpy(),
            "wind_direction_100m": hourly.Variables(17).ValuesAsNumpy(),
            "wind_gusts_10m": hourly.Variables(18).ValuesAsNumpy(),
        }

        # Create a DataFrame from the processed data
        hourly_dataframe = pd.DataFrame(data=hourly_data)

        # Save the DataFrame to CSV with the original city name
        hourly_dataframe.to_csv(file_path)

        total_downloaded += 1

    # Calculate the total remaining cities to be downloaded
    total_remaining = total_cities - total_downloaded - TOTAL_EXISTING

    # Print statistics for each run
    print(f"\nTotal Cities: {total_cities}")
    print(f"Total Downloaded for Run {run}: {total_downloaded}")
    print(f"Total Existing: {TOTAL_EXISTING}")
    print(f"Total Remaining: {total_remaining}")
    print(f"Total Errors: {error_count}")

print("All runs completed.")


# Cell 5: Display details of the CSV file

```html
<!-- Displaying information about the CSV file -->
<p>Dataframe Information:</p>
{{ df_sample_info }}

<!-- Displaying shape of the CSV file -->
<p>Dataframe Shape:</p>
{{ df_sample_shape }}

<!-- Displaying descriptive statistics of the CSV file -->
<p>Dataframe Descriptive Statistics:</p>
{{ df_sample_describe }}


In [None]:
# Cell 5: Import necessary libraries
import pandas as pd
import os

# Specify the path to the CSV file you want to display
csv_file_path = '/kaggle/working/Dataset/Delhi.csv'

# Load the CSV file into a DataFrame
df_sample = pd.read_csv(csv_file_path)

# Display the DataFrame
df_sample.head()




In [None]:
# Displaying all details about the CSV file
df_sample_info = df_sample.info()
df_sample_shape = df_sample.shape
df_sample_describe = df_sample.describe()

df_sample_info, df_sample_shape, df_sample_describe

<!-- Cell 6 Markdown -->
# Cell 6: Find Cities Not Scraped
- **Read Weather Data**:
  - Read the "weather.csv" file to get city names.
- **Extract Unique Cities**:
  - Extract unique city names along with their coordinates.
- **Find Missing Cities**:
  - Compare cities in the weather dataset with those in the dataset folder to find missing cities.
- **Save Results**:
  - Save the list of cities not scraped to a CSV file.


In [None]:
# Cell 6: Find cities not scraped
# Read the weather.csv file
weather_csv_path = '/kaggle/working/weather_modified.csv'
df_weather = pd.read_csv(weather_csv_path)

# Extract unique city names from weather.csv
unique_cities_weather = df_weather[['City', 'Lat', 'Lng']].drop_duplicates()

# Directory containing the dataset files
dataset_folder = '/kaggle/working/Dataset'

# Get list of unique city names from filenames in the dataset folder
files_in_dataset = os.listdir(dataset_folder)
unique_cities_dataset = set(os.path.splitext(file)[0] for file in files_in_dataset)

# Find cities that are in weather.csv but not in the dataset folder
cities_not_scraped = unique_cities_weather[~unique_cities_weather['City'].isin(unique_cities_dataset)]

# Reset index to have a proper index in the final DataFrame
cities_not_scraped.reset_index(drop=True, inplace=True)

# Save the DataFrame to CSV
output_csv_path = '/kaggle/working/cities_not_scraped.csv'
cities_not_scraped.to_csv(output_csv_path, index=True)

print(f"CSV file containing cities not scraped has been saved to: {output_csv_path}")

# Display the DataFrame
cities_not_scraped

In [None]:
# Cell 5: Find cities not scraped

# Features of the DataFrame
df_features = cities_not_scraped.columns.tolist()

# Information about the DataFrame
df_info = cities_not_scraped.info()

# Shape of the DataFrame
df_shape = cities_not_scraped.shape

# Display all details
df_features, df_info, df_shape



# Cell 7: Plots
- **Correlation Heatmap with adjusted font size and color palette**:
- **Histogram of Relative Humidity**:
- **Scatter Plot of Temperature vs. Dew Point**:
- **Wind Rose Plot (Assuming you have wind direct# Box Plot of Cloud Cover**:
- **Box Plot of Cloud Cover**:
- **Bar Plot of Precipitation Types**:
- **Line Plot of Wind Speed**:

In [None]:
# !pip install matplotlib
# !pip install seaborn
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset
file_path = '/kaggle/working/Dataset/Delhi.csv'
data = pd.read_csv(file_path)

# Convert 'date' column to datetime format
data['date'] = pd.to_datetime(data['date'])

# Time Series Plot of Temperature
plt.figure(figsize=(12, 6))
plt.plot(data['date'], data['temperature_2m'], color='blue')
plt.title('Temperature Time Series')
plt.xlabel('Date')
plt.ylabel('Temperature (°C)')
plt.grid(True)
plt.show()

# Correlation Heatmap with adjusted font size and color palette
plt.figure(figsize=(10, 8))
sns.heatmap(data.corr(), annot=True, fmt='.2f', cmap='viridis', annot_kws={'size': 10})
plt.title('Correlation Heatmap')
plt.show()

# Histogram of Relative Humidity
plt.figure(figsize=(8, 6))
plt.hist(data['relative_humidity_2m'], bins=30, color='green', edgecolor='black')
plt.title('Histogram of Relative Humidity')
plt.xlabel('Relative Humidity (%)')
plt.ylabel('Frequency')
plt.grid(axis='y', alpha=0.75)
plt.show()

# Scatter Plot of Temperature vs. Dew Point
plt.figure(figsize=(8, 6))
plt.scatter(data['temperature_2m'], data['dew_point_2m'], color='orange', alpha=0.5)
plt.title('Temperature vs. Dew Point')
plt.xlabel('Temperature (°C)')
plt.ylabel('Dew Point (°C)')
plt.grid(True)
plt.show()

# Wind Rose Plot (Assuming you have wind direction data)
plt.figure(figsize=(10, 8))
sns.histplot(data['wind_direction_10m'], bins=36, stat='density', linewidth=0)
plt.title('Wind Rose Plot')
plt.xlabel('Wind Direction (degrees)')
plt.ylabel('Density')
plt.show()

# Box Plot of Cloud Cover
plt.figure(figsize=(8, 6))
sns.boxplot(x=data['cloud_cover'], color='skyblue')
plt.title('Box Plot of Cloud Cover')
plt.xlabel('Cloud Cover (%)')
plt.show()

# Bar Plot of Precipitation Types
precipitation_types = ['rain', 'snowfall']
precipitation_data = data[precipitation_types].sum()
plt.figure(figsize=(8, 6))
plt.bar(precipitation_types, precipitation_data, color=['blue', 'cyan'])
plt.title('Bar Plot of Precipitation Types')
plt.xlabel('Precipitation Type')
plt.ylabel('Total Amount')
plt.show()

# Line Plot of Wind Speed
plt.figure(figsize=(12, 6))
plt.plot(data['date'], data['wind_speed_10m'], color='green')
plt.title('Wind Speed Time Series')
plt.xlabel('Date')
plt.ylabel('Wind Speed (km/h)')
plt.grid(True)
plt.show()


<h2>    Violin Plot of Temperature Distribution by Month:</h2>

In [None]:
plt.figure(figsize=(12, 6))
sns.violinplot(x=data['date'].dt.month, y=data['temperature_2m'], palette='coolwarm')
plt.title('Temperature Distribution by Month')
plt.xlabel('Month')
plt.ylabel('Temperature (°C)')
plt.grid(axis='y', alpha=0.75)
plt.show()


<h2>    Line Plot of Wind Speed and Wind Gusts:</h2>

In [None]:
plt.figure(figsize=(12, 6))
plt.plot(data['date'], data['wind_speed_10m'], label='Wind Speed', color='green')
plt.plot(data['date'], data['wind_gusts_10m'], label='Wind Gusts', color='blue')
plt.title('Wind Speed and Wind Gusts Time Series')
plt.xlabel('Date')
plt.ylabel('Speed (km/h)')
plt.legend()
plt.grid(True)
plt.show()


<h2>    Stacked Area Plot of Precipitation:</h2>

In [None]:
plt.figure(figsize=(12, 6))
plt.plot(data['date'], data['rain'], label='Rain', color='blue')
plt.plot(data['date'], data['snowfall'], label='Snowfall', color='cyan')
plt.fill_between(data['date'], data['rain'], alpha=0.3, color='blue')
plt.fill_between(data['date'], data['snowfall'], alpha=0.3, color='cyan')
plt.title('Total Precipitation (Rain and Snowfall)')
plt.xlabel('Date')
plt.ylabel('Amount')
plt.legend()
plt.ylim(0, 35)  # Set the y-axis limit to 0-15
plt.grid(True)
plt.show()


<h2>    Box Plot of Apparent Temperature by Season:</h2>

In [None]:
data['season'] = (data['date'].dt.month%12 + 3)//3
data['season'] = data['season'].map({1: 'Winter', 2: 'Spring', 3: 'Summer', 4: 'Fall'})
plt.figure(figsize=(10, 6))
sns.boxplot(x=data['season'], y=data['apparent_temperature'], palette='coolwarm')
plt.title('Apparent Temperature by Season')
plt.xlabel('Season')
plt.ylabel('Apparent Temperature (°C)')
plt.grid(axis='y', alpha=0.75)
plt.show()


In [None]:
# Replace infinite values with NaN
data.replace([float('inf'), float('-inf')], pd.NA, inplace=True)

# Scatter Plot Matrix
sns.pairplot(data[['temperature_2m', 'relative_humidity_2m', 'precipitation', 'wind_speed_10m']])
plt.show()


<h2></h2>

<h2></h2>