Dataset(s) to be used:
1.https://www.kaggle.com/datasets/risakashiwabara/japannumber-of-visitors-to-japan/data
2.https://www.kaggle.com/datasets/sasakitetsuya/daily-mean-temperatures-data-of-4-cities-in-japan?resource=download

Analysis question: What is the relationship between tourism in Japan and temperature?

Columns that will (likely) be used: All columns

Hypothesis: Tourism in Japan would peak during the mild seasons of spring and autumn, corresponding to temperatures between 10°C and 25°C, while colder and hotter months would attract fewer visitors.

**Read data**: This section reads the two input datasets.
The first dataset contains daily average temperatures for four cities in Japan,
and the second dataset contains monthly foreign visitor counts by country.

In [2]:
import pandas as pd
import plotly.express as px

temp_path = "japan_city_daily_tavg_2005-2025.csv"
tour_path = "Number of foreign visitors to Japan by month_ .csv"

temp_df = pd.read_csv(temp_path)
tour_df = pd.read_csv(tour_path)



**Process temperature data (4 cities, 2017–2023)**: This section processes the temperature dataset.
It keeps only the years 2017–2023, computes each day's average temperature
across Sapporo, Tokyo, Osaka, and Fukuoka, aggregates to monthly averages
for each year, and then computes the average temperature for each month
across the seven years (2017–2023).

In [3]:
temp_df["date"] = pd.to_datetime(temp_df["date"])
temp_df["year"] = temp_df["date"].dt.year
temp_df["month"] = temp_df["date"].dt.month

temp_df_17_23 = temp_df[(temp_df["year"] >= 2017) & (temp_df["year"] <= 2023)].copy()

city_cols = ["Sapporo", "Tokyo", "Osaka", "Fukuoka"]
temp_df_17_23["city_mean"] = temp_df_17_23[city_cols].mean(axis=1, skipna=True)

monthly_year_temp = (
    temp_df_17_23
    .groupby(["year", "month"], as_index=False)["city_mean"]
    .mean()
)

temp_month_avg = (
    monthly_year_temp
    .groupby("month", as_index=False)["city_mean"]
    .mean()
    .rename(columns={"city_mean": "avg_temp"})
)

**Process tourism data (all countries, 2017–2023)**: This section processes the tourism dataset.
It cleans the month column, converts month abbreviations to month numbers,
keeps only years 2017–2023, sums visitor counts across all countries for
each year-month, and computes the average number of visitors per month
across the seven years (2017–2023).

In [4]:
tour_df = tour_df.rename(columns={"Month ": "Month"})

tour_df["Month_clean"] = (
    tour_df["Month"]
    .astype(str)
    .str.strip()
    .replace(".", "", regex=False)
)

month_map = {
    "Jan": 1, "Feb": 2, "Mar": 3, "Apr": 4,
    "May": 5, "Jun": 6, "Jul": 7, "Aug": 8,
    "Sep": 9, "Oct": 10, "Nov": 11, "Dec": 12,
}
tour_df["month"] = tour_df["Month_clean"].str[:3].map(month_map)

tour_df_17_23 = tour_df[(tour_df["Year"] >= 2017) & (tour_df["Year"] <= 2023)].copy()

tour_year_month = (
    tour_df_17_23
    .groupby(["Year", "month"], as_index=False)["Visitor"]
    .sum()
)

tour_month_avg = (
    tour_year_month
    .groupby("month", as_index=False)["Visitor"]
    .mean()
    .rename(columns={"Visitor": "avg_visitors"})
)

**Merge by month**: This section merges the temperature and tourism datasets
on the month column, resulting in a single table that contains
the average temperature and average number of visitors for each month.

In [5]:
merged = pd.merge(temp_month_avg, tour_month_avg, on="month").sort_values("month")

print(merged)

    month   avg_temp  avg_visitors
0       1   4.106336  1.672743e+06
1       2   4.924361  1.390581e+06
2       3   9.526651  1.380490e+06
3       4  13.759484  1.501275e+06
4       5  18.672197  1.400111e+06
5       6  22.035437  1.448114e+06
6       7  26.318433  1.243452e+06
7       8  27.410753  1.111437e+06
8       9  23.597857  9.929899e+05
9      10  17.489977  1.182924e+06
10     11  12.192103  1.183151e+06
11     12   6.168971  1.302899e+06


**Plot with plotly.express**: This section creates a line chart using plotly.express.
The x-axis represents average temperature, and the y-axis represents
average foreign visitors. Each point corresponds to a month (1–12),
and because the dataset is sorted by month, the line connects points
in chronological order even though the x-axis reflects temperature values.

In [6]:
fig = px.line(
    merged,
    x="avg_temp",
    y="avg_visitors",
    text="month",
    markers=True,
    labels={
        "avg_temp": "Average Monthly Temperature (°C)",
        "avg_visitors": "Average Monthly Foreign Visitors"
    },
    title="Line Chart: Temperature vs Foreign Visitors (2017–2023 Monthly Averages)"
)

fig.update_traces(textposition="top center")
fig.show()

The original hypothesis proposed that tourism in Japan would peak during the mild seasons of spring and autumn, corresponding to temperatures between 10°C and 25°C, while colder and hotter months would attract fewer visitors.
However, the actual data reveals a different pattern.

Based on the monthly averages from 2017 to 2023, the highest number of foreign visitors occurs during the winter months, when temperatures fall below 10°C. Visitor numbers drop during the summer, which represents the lowest levels of the year, and the spring–autumn months fall in between the two extremes.

In other words:

1.Most tourists come during winter (temperature < 10°C)

2.Second-highest visitor levels occur during spring and autumn (10°C–25°C)

3.Summer has the lowest visitor numbers despite higher temperatures

This outcome contradicts the initial hypothesis and suggests that factors beyond temperature—such as holiday seasons, snow-related activities, or major winter events—may play a larger role in driving tourist inflow to Japan.