# Pacific Hurricane Season - 1975
---

> Konstantinos Mpouros <br>
> Github: https://github.com/konstantinosmpouros <br>
> Year: 2024 <br>

## 1. About the Notebook

This project involves scraping and processing data from the Wikipedia page dedicated to the 1975 Pacific hurricane season. The primary goal is to extract detailed information about hurricanes that occurred during this season and structure this information in a CSV format. This task will involve web scraping techniques to gather raw data and natural language processing (NLP) methods to clean and structure the extracted data effectively.

**Objective**

The objective of this project is to develop a Python script that performs the following tasks:

1. **Scraping the Wikipedia Page**: Utilize Beautiful Soup and Requests libraries to scrape relevant data from the Wikipedia page for the 1975 Pacific hurricane season. This includes collecting information such as hurricane names, start and end dates, number of deaths, and areas affected.

2. **Data Extraction and Processing**: Apply GPT-4o Mini to clean and process the extracted text. This involves parsing the scraped paragraphs to extract and organize pertinent details into a structured format.

3. **Data Structuring and Output**: Convert the cleaned data into a CSV file named `hurricanes_1975.csv`, containing the following columns:
   - `hurricane_storm_name`: The name of the hurricane.
   - `paragraph`: The paragraph that we used as input the GPT-4o Mini to extract the data.
   - `date_start`: The start date of the hurricane.
   - `date_end`: The end date of the hurricane.
   - `number_of_deaths`: The number of fatalities attributed to the hurricane.
   - `list_of_areas_affected`: A list of regions or areas affected by the hurricane.

## 2. Libraries

In [1]:
# Data Manipulation
import pandas as pd

# Regex
import re

# Web Scraping
from bs4 import BeautifulSoup
import requests

# OpenAI connection
from openai import OpenAI

## 3. Web Scraping

In [2]:
url = 'https://en.wikipedia.org/wiki/1975_Pacific_hurricane_season'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

In [3]:
# Initialize variables to store the hurricane names, paragraphs and a list for the whole data
name, paragraph = '', ''
data = []

# Enumerate the "mw-heading mw-heading2" and stand only in the second one which is the "Systems" section
for i, tag in enumerate(soup.find_all('div', class_='mw-heading mw-heading2')):
    if i == 1:

        # While the 'Summary' section hasnt ended jump from sibling to sibling and extract the data
        sibling = tag.find_next_sibling()
        while sibling.get('class', 'No class') != ['mw-heading', 'mw-heading2']:

            # Loop in every sibling and extract the data from the specific div and p tags
            while True:
                # Extract the data from the specific tags
                if sibling.name == 'div' and sibling.get('class', 'No class') == ['mw-heading', 'mw-heading3']:
                    name = sibling.text
                    name = re.sub(r'\[.*?\]', '', name)
                elif sibling.name == 'p':
                    paragraph += ' ' + sibling.text
                    paragraph = re.sub(r'[\n\xa0]|\[.*?\]', '', paragraph).strip()

                # Move to the next sibling
                sibling = sibling.find_next_sibling()

                # If the next sibling is the end of section 'Summary' break, else if we move to another hurricane append a new row
                if sibling.name == 'div' and sibling.get('class', 'No class') == ['mw-heading', 'mw-heading2']: 
                    break
                elif sibling.name == 'div' and sibling.get('class', 'No class') == ['mw-heading', 'mw-heading3']:
                    data.append([name, paragraph])
                    name, paragraph = '', ''

data = pd.DataFrame(data, columns=['hurricane_storm_name', 'paragraph'])
data

Unnamed: 0,hurricane_storm_name,paragraph
0,Hurricane Agatha,An area of disturbed weather about 290mi (467k...
1,Tropical Storm Bridget,"On June 27, a tropical depression formed about..."
2,Hurricane Carlotta,A disturbance 480mi (772km) south of Acapulco ...
3,Hurricane Denise,An unstable area developed a circulation and b...
4,Tropical Storm Eleanor,An area of disturbed weather developed into a ...
5,Tropical Storm Francene,A rapidly moving squally area of disturbed wea...
6,Tropical Storm Georgette,"An area of disturbed weather about 800mi (1,28..."
7,Tropical Storm Hilary,A tropical disturbance formed on August 11 and...
8,Hurricane Ilsa,"On August 18, a tropical depression formed sou..."
9,Hurricane Jewel,"On August24, a tropical depression formed from..."


In [4]:
data['date_start'], data['date_end'], data['number_of_deaths'], data['list_of_areas_affected'] = '', '', '', ''
data.head()

Unnamed: 0,hurricane_storm_name,paragraph,date_start,date_end,number_of_deaths,list_of_areas_affected
0,Hurricane Agatha,An area of disturbed weather about 290mi (467k...,,,,
1,Tropical Storm Bridget,"On June 27, a tropical depression formed about...",,,,
2,Hurricane Carlotta,A disturbance 480mi (772km) south of Acapulco ...,,,,
3,Hurricane Denise,An unstable area developed a circulation and b...,,,,
4,Tropical Storm Eleanor,An area of disturbed weather developed into a ...,,,,


## 4. ChatGPT

In [5]:
gpt_keys = pd.read_csv('../ChatGPT API Keys.txt').columns
api_key = gpt_keys[0]
org_key = gpt_keys[1]
prj_key = gpt_keys[2]

In [6]:
client = OpenAI(
    api_key=api_key,
    organization=org_key,
    project=prj_key
)

In [7]:
for model in client.models.list():
    print(model)

Model(id='gpt-4-1106-preview', created=1698957206, object='model', owned_by='system')
Model(id='tts-1-1106', created=1699053241, object='model', owned_by='system')
Model(id='dall-e-2', created=1698798177, object='model', owned_by='system')
Model(id='tts-1', created=1681940951, object='model', owned_by='openai-internal')
Model(id='tts-1-hd-1106', created=1699053533, object='model', owned_by='system')
Model(id='tts-1-hd', created=1699046015, object='model', owned_by='system')
Model(id='dall-e-3', created=1698785189, object='model', owned_by='system')
Model(id='whisper-1', created=1677532384, object='model', owned_by='openai-internal')
Model(id='text-embedding-3-large', created=1705953180, object='model', owned_by='system')
Model(id='text-embedding-3-small', created=1705948997, object='model', owned_by='system')
Model(id='text-embedding-ada-002', created=1671217299, object='model', owned_by='openai-internal')
Model(id='gpt-4-turbo', created=1712361441, object='model', owned_by='system')
M

In [8]:
def create_prompt(paragraph):
    prompt = f"""
    Extract the following information from the given paragraph:
    1. Number of deaths
    2. Start date
    3. End date
    4. Affected areas

    Paragraph:
    {paragraph}

    Please provide the information in the following format, if no information provided reply None or 0:
    - Number of deaths: <number_of_deaths>
    - Start date: <start_date>
    - End date: <end_date>
    - Affected areas: <affected_areas>
    """
    return prompt

In [9]:
for i in range(len(data)):
    # Create the prompt
    prompt = create_prompt(data.iloc[i, 1])

    # Request to the GPT-4o-mini model
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "user", "content": prompt}
        ],
        max_tokens=300,
        temperature=0.5 
    )

    response_text = response.choices[0].message.content
    
    # Extract information using regular expressions
    number_of_deaths = re.search(r'- Number of deaths: (.*?)\n', response_text).group(1)
    start_date = re.search(r'- Start date: (.*?)\n', response_text).group(1)
    end_date = re.search(r'- End date: (.*?)\n', response_text).group(1)
    affected_areas = re.search(r'- Affected areas: (.*)', response_text).group(1)

    
    data.iloc[i, 2], data.iloc[i, 3], data.iloc[i, 4], data.iloc[i, 5] = start_date, end_date, number_of_deaths, affected_areas

In [10]:
data

Unnamed: 0,hurricane_storm_name,paragraph,date_start,date_end,number_of_deaths,list_of_areas_affected
0,Hurricane Agatha,An area of disturbed weather about 290mi (467k...,June 1,June 7,0,"Southwest of Acapulco, near Zihuatanejo, south..."
1,Tropical Storm Bridget,"On June 27, a tropical depression formed about...",June 27,July 3,0,
2,Hurricane Carlotta,A disturbance 480mi (772km) south of Acapulco ...,July 2,July 11,0,
3,Hurricane Denise,An unstable area developed a circulation and b...,July 4,July 14,0,parts of Mexico
4,Tropical Storm Eleanor,An area of disturbed weather developed into a ...,July 10,July 12,0,Manzanillo
5,Tropical Storm Francene,A rapidly moving squally area of disturbed wea...,July 27,July 30,0,
6,Tropical Storm Georgette,"An area of disturbed weather about 800mi (1,28...",August 11,August 14,0,
7,Tropical Storm Hilary,A tropical disturbance formed on August 11 and...,August 11,August 17,0,
8,Hurricane Ilsa,"On August 18, a tropical depression formed sou...",August 18,September 5,0,
9,Hurricane Jewel,"On August24, a tropical depression formed from...",August 24,September 3,0,


In [12]:
data.to_csv('hurricanes_1975.csv')