## Getting home plants data from Wikipedia

This purpose of this code is gathering structured information about house plants from Wikipedia. It is divided into two main stages:

1. Collecting Plant Names from the Category Page

* The code sends a request to the Wikipedia category “House plants” page.

* It uses BeautifulSoup to parse the HTML and extract the titles of all pages (plant names) listed under that category.


The result is the foundation for the next stage.

2. Retrieving Detailed Plant Information

* For each plant name, the wikipedia Python library is used to fetch the plant’s Wikipedia page.

* The code retrieves a short summary of the plant and searches the full page content for sections on Cultivation and Toxicity (if available) using regular expressions.


The final result is a structured dataset containing the name, summary, cultivation details, and toxicity information for each house plant listed on Wikipedia.

In [1]:
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd
import re
import requests
import time
from tqdm import tqdm
import wikipedia

## Getting data from Wikipedia

In [3]:
# Step 1: Get HTML content
url = "https://en.wikipedia.org/wiki/Category:House_plants"
resp = requests.get(url)
resp.raise_for_status()  # ensure request worked

# Step 2: Parse with BeautifulSoup
soup = BeautifulSoup(resp.text, "html.parser")

# Step 3: Extract category members
# Wikipedia category pages list entries in <div id="mw-pages"> inside <a> tags
plant_names = []
pages_div = soup.find("div", id="mw-pages")
if pages_div:
    for link in pages_div.find_all("a"):
        title = link.get_text()
        # skip navigation links like "next page"
        if not title.lower().startswith("next") and not title.lower().startswith("previous"):
            plant_names.append(title)

#skip 2 first rows which are not plant names
print(plant_names[2:])

['Adelonema wallisii', 'Adenium obesum', 'Adiantum aethiopicum', 'Adiantum capillus-veneris', 'Adiantum formosum', 'Adiantum peruvianum', 'Aechmea chantinii', 'Aechmea fasciata', 'Aglaonema', 'Aglaonema commutatum', 'Aglaonema modestum', 'Aichryson × aizoides', 'Alocasia × mortfontanensis', 'Alocasia azlanii', 'Alocasia baginda', 'Alocasia brancifolia', 'Alocasia heterophylla', 'Alocasia infernalis', 'Alocasia lauterbachiana', 'Alocasia longiloba', 'Alocasia melo', 'Alocasia micholitziana', 'Alocasia nebula', 'Alocasia nycteris', 'Alocasia portei', 'Alocasia princeps', 'Alocasia reginae', 'Alocasia reginula', 'Alocasia reversa', 'Alocasia sanderiana', 'Alocasia sarawakensis', 'Alocasia scalprum', 'Alocasia sinuata', 'Alocasia wentii', 'Alocasia zebrina', 'Angraecum sesquipedale', 'Anthurium amnicola', 'Anthurium bakeri', 'Anthurium clarinervium', 'Anthurium crenatum', 'Anthurium gracile', 'Anthurium radicans', 'Anthurium scherzerianum', 'Anthurium vittariifolium', 'Anthurium warocquean

In [4]:
def get_plant_data(plant_name):
    """
    Fetch detailed plant information from Wikipedia.

    Args:
        plant_name (str): The name of the plant to search for.

    Returns:
        dict | None: A dictionary with the plant's name, summary, cultivation, and toxicity info,
                     or None if the plant page cannot be retrieved.
    """
    try:
        try:
            summary_text = wikipedia.summary(plant_name)
            page = wikipedia.page(plant_name)
        except wikipedia.exceptions.DisambiguationError:
            # Retry with a more specific search term
            refined_name = f"{plant_name} plant"
            summary_text = wikipedia.summary(refined_name)
            page = wikipedia.page(refined_name)
            plant_name = refined_name

    except wikipedia.exceptions.PageError:
        # Page does not exist
        print(f"Error: '{plant_name}' page not found.")
        return {
        'name': plant_name,
        'summary': None,
        'cultivation': None,
        'toxicity': None
    }
    except Exception as e:
        # Catch-all for other unexpected issues
        print(f"Unexpected error fetching '{plant_name}': {e}")
        return  {
        'name': plant_name,
        'summary': None,
        'cultivation': None,
        'toxicity': None
    }

    # Extract sections only if page content is available
    cultivation_text = None
    toxicity_text = None

    if page and hasattr(page, 'content'):
        cultivation_match = re.search(r"==\s*Cultivation(?:\s+and\s+uses)?\s*==([\s\S]*?)(?=\n==|\Z)", page.content, flags=re.S)
        if cultivation_match:
            cultivation_text = cultivation_match.group(1).strip()

        toxicity_match = re.search(r"==\s*Toxicity\s*==([\s\S]*?)(?=\n==|\Z)", page.content, flags=re.S)
        if toxicity_match:
            toxicity_text = toxicity_match.group(1).strip()

    return {
        'name': plant_name,
        'summary': summary_text,
        'cultivation': cultivation_text,
        'toxicity': toxicity_text
    }

In [27]:
all_plants_details = []

# Skip first 2 entries that aren't plant names
for plant in tqdm(plant_names[2:], desc="Fetching plant details"):
    retries = 3
    for attempt in range(retries):
        try:
            plant_details = get_plant_data(plant)
            all_plants_details.append(plant_details)
            break  # success, stop retrying
        except ConnectionError:
            if attempt < retries - 1:
                print(f"Connection error for {plant}, retrying ({attempt+1}/{retries})...")
                time.sleep(5)
            else:
                all_plants_details.append({"name": plant, "error": "Connection failed"})

Fetching plant details:  32%|███▏      | 63/199 [00:19<00:35,  3.88it/s]

Error: 'Bird's-nest fern' page not found.




  lis = BeautifulSoup(html).find_all('li')
Fetching plant details:  87%|████████▋ | 174/199 [01:32<00:06,  3.92it/s]

Error: 'Peperomia' page not found.


Fetching plant details: 100%|██████████| 199/199 [01:40<00:00,  1.99it/s]


In [28]:
all_plants_details[:5]

[{'name': 'Adelonema wallisii',
  'summary': 'Adelonema wallisii (synonym Homalomena wallisii) is a species of aroid plant (family Araceae) native to Venezuela, Colombia, and Panama.\n\n',
  'cultivation': None,
  'toxicity': None},
 {'name': 'Adenium obesum',
  'summary': 'Adenium obesum, more commonly known as a desert rose, is a poisonous species of flowering plant belonging to the tribe Nerieae of the subfamily Apocynoideae of the dogbane family, Apocynaceae. It is native to the Sahel regions south of the Sahara (from Mauritania and Senegal to Sudan), tropical and subtropical eastern and southern Africa, as well as the Arabian Peninsula. Other names for the flower include Sabi star, kudu, mock azalea, and impala lily. Adenium obesum is a popular houseplant and bonsai in temperate regions.\n\n',
  'cultivation': "Adenium obesum is a popular houseplant and bonsai in temperate regions. It requires a sunny location and a minimum indoor temperature in winter of 10 °C (50 °F). It thrives

In [29]:
# to csv 
df_all_plants_details = pd.DataFrame(all_plants_details)

In [30]:
#column error is coused by 
df_all_plants_details.head()

Unnamed: 0,name,summary,cultivation,toxicity
0,Adelonema wallisii,Adelonema wallisii (synonym Homalomena wallisi...,,
1,Adenium obesum,"Adenium obesum, more commonly known as a deser...",Adenium obesum is a popular houseplant and bon...,
2,Adiantum aethiopicum,"Adiantum aethiopicum, also known as the common...",Adiantum aethiopicum is a popular and well kno...,
3,Adiantum capillus-veneris,"Adiantum capillus-veneris, the Southern maiden...",Adiantum capillus-veneris is cultivated and w...,
4,Adiantum formosum,"Adiantum formosum, known as the giant maidenha...",,


### Cleaning

In [31]:
print("DataFrame shape", df_all_plants_details.shape)
print("Plants with any data")
print(df_all_plants_details[df_all_plants_details['summary'].isna() & df_all_plants_details['cultivation'].isna() & df_all_plants_details['toxicity'].isna()])
print("---------------------")
# remove empty rows
df_all_plants_details = df_all_plants_details[~(df_all_plants_details['summary'].isna() & df_all_plants_details['cultivation'].isna() & df_all_plants_details['toxicity'].isna())]
print("Shape after removing empty rows", df_all_plants_details.shape)

DataFrame shape (199, 4)
Plants with any data
                 name summary cultivation toxicity
62   Bird's-nest fern    None        None     None
173         Peperomia    None        None     None
---------------------
Shape after removing empty rows (197, 4)


In [32]:
# Adding an ID column
df_all_plants_details = df_all_plants_details.reset_index(drop=True)
df_all_plants_details['id'] = df_all_plants_details.index  # start IDs from  0
# Replace NaN with a specific string using replace
df_all_plants_details.replace(np.nan, 'No data available', inplace=True)

print(df_all_plants_details)

                           name  \
0            Adelonema wallisii   
1                Adenium obesum   
2          Adiantum aethiopicum   
3     Adiantum capillus-veneris   
4             Adiantum formosum   
..                          ...   
192         Pilea peperomioides   
193      Platycerium bifurcatum   
194        Platycerium superbum   
195  Plectranthus verticillatus   
196           Portulacaria afra   

                                               summary  \
0    Adelonema wallisii (synonym Homalomena wallisi...   
1    Adenium obesum, more commonly known as a deser...   
2    Adiantum aethiopicum, also known as the common...   
3    Adiantum capillus-veneris, the Southern maiden...   
4    Adiantum formosum, known as the giant maidenha...   
..                                                 ...   
192  Pilea peperomioides (), the Chinese money plan...   
193  Platycerium bifurcatum, commonly known as the ...   
194  Platycerium is a genus of about 18 fern specie...   

In [33]:
df_all_plants_details.head()

Unnamed: 0,name,summary,cultivation,toxicity,id
0,Adelonema wallisii,Adelonema wallisii (synonym Homalomena wallisi...,No data available,No data available,0
1,Adenium obesum,"Adenium obesum, more commonly known as a deser...",Adenium obesum is a popular houseplant and bon...,No data available,1
2,Adiantum aethiopicum,"Adiantum aethiopicum, also known as the common...",Adiantum aethiopicum is a popular and well kno...,No data available,2
3,Adiantum capillus-veneris,"Adiantum capillus-veneris, the Southern maiden...",Adiantum capillus-veneris is cultivated and w...,No data available,3
4,Adiantum formosum,"Adiantum formosum, known as the giant maidenha...",No data available,No data available,4


## Save file

In [34]:
# Changing column order
df_all_plants_details[['id', 'name', 'summary', 'cultivation', 'toxicity']].to_csv("../data/plants_data.csv", index=False)