## **Bikes Data Scraping from `Bikez.com` using BeautifulSoup**

## **Overview**
- Uses Python with libraries like `requests`, `BeautifulSoup`, and `pandas`.
- Automatically detects all available years on Bikez.com.
- Iteratively scrapes data for each year and appends it to a central DataFrame.
- Limits the number of rows per year to manage load and testing.
- Saves the resulting dataset to a CSV file named `bikes_data.csv`

### **Import Libraries & Fetch All Available Years**

In [None]:
from bs4 import BeautifulSoup
import requests

starting_url = 'https://bikez.com/years/index.php'
starting_request = requests.get(starting_url)
starting_soup = BeautifulSoup(starting_request.text, 'html.parser')

even_years = starting_soup.find_all('td', {'class': 'even'})
odd_years = starting_soup.find_all('td', {'class': 'odd'})
all_years = even_years + odd_years


### **(Commented) Define Scraping Outline & Helper Function**
 - After scraping, check if the DataFrame is empty. If no data was collected, so need to uncomment this section and re-run the entire code agian.

In [None]:
# import pandas as pd

# models = pd.DataFrame(columns=['Model', 'Year', 'URL'])

# for year in all_years:
#     year_url = 'https://bikez.com' + year.a['href'].split('..')[1]
#     year_text = year.a.text.strip()
#     year_number = ''.join(filter(str.isdigit, year_text))  # Extract numbers like "2024"

#     if not year_number:
#         continue  # Skip if no year found

#     year_models = scrape_year(year_url, int(year_number))
#     models = pd.concat([models, year_models], ignore_index=True)


### **Initialize DataFrame & Loop Through Years to Build Dataset**

In [None]:
import pandas as pd

models = pd.DataFrame(columns=['Model', 'Year', 'URL'])
limit = 10000  # As much as you want to scrape but in the range of 42k
count = 0

for year in all_years:
    if count >= limit:
        break

    year_url = 'https://bikez.com' + year.a['href'].split('..')[1]
    year_text = year.a.text.strip()
    year_number = ''.join(filter(str.isdigit, year_text))

    if not year_number:
        continue

    year_models = scrape_year(year_url, int(year_number))

    # Count how many rows we're about to add
    remaining = limit - count
    year_models = year_models.iloc[:remaining]  # Sirf required entries lo

    models = pd.concat([models, year_models], ignore_index=True)
    count += len(year_models)


### **Preview the First Five Results**
- In Cell 4, to explain you’re printing a quick sample of the scraped data.

In [None]:
print("Scraping complete! Showing 5 results:")
print(models.head())


Scraping complete! Showing 5 results:
                     Model  Year  \
0           Aeon AI-3 More  2024   
1    AJP PR7 650 Adventure  2024   
2         AJS Barletta 125  2024   
3  Apollino AM Thunder 125  2024   
4           Aprilia RS 457  2024   

                                                 URL  
0  https://bikez.com/motorcycles/aeon_ai-3_more_2...  
1  https://bikez.com/motorcycles/ajp_pr7_650_adve...  
2  https://bikez.com/motorcycles/ajs_barletta_125...  
3  https://bikez.com/motorcycles/apollino_am_thun...  
4  https://bikez.com/motorcycles/aprilia_rs_457_2...  


### **Display the Full DataFrame in Jupyter**
- Cell 5, indicating that merely writing models will render the entire table in the notebook.

In [None]:
models

Unnamed: 0,Model,Year,URL
0,Aeon AI-3 More,2024,https://bikez.com/motorcycles/aeon_ai-3_more_2...
1,AJP PR7 650 Adventure,2024,https://bikez.com/motorcycles/ajp_pr7_650_adve...
2,AJS Barletta 125,2024,https://bikez.com/motorcycles/ajs_barletta_125...
3,Apollino AM Thunder 125,2024,https://bikez.com/motorcycles/apollino_am_thun...
4,Aprilia RS 457,2024,https://bikez.com/motorcycles/aprilia_rs_457_2...
...,...,...,...
9995,Kymco G-Dink 300i,2012,https://bikez.com/motorcycles/kymco_g-dink_300...
9996,Kymco K-XCT 125,2012,https://bikez.com/motorcycles/kymco_k-xct_125_...
9997,Kymco Like 125,2012,https://bikez.com/motorcycles/kymco_like_125_2...
9998,Kymco Like 200i LX,2012,https://bikez.com/motorcycles/kymco_like_200i_...


### **Save the Collected Data to CSV**
- Cell 6, explaining that this cell writes `models out to csv` file name bikes_data.csv.

In [None]:
models.to_csv('bikes_data.csv', index=False)

### **Final Completion Message**
- Cell 7, noting that this prints a friendly `“Alhumdulillah”` message when done.

In [1]:
print("Alhumdulillah (^_^)")

Alhumdulillah (^_^)
