### Extraction of Posts and Comments from Reddit

In this project, we focus on extracting posts and comments from Reddit using web scraping techniques and interaction with the API. Reddit is a platform rich in user-generated content on various topics, making it a valuable resource for applications such as sentiment analysis, trend tracking, and topic modeling.

#### Objectives
- **Data Collection**: Retrieve posts and comments from specific subreddits based on thematic criteria (e.g., r/mentalhealth, r/fitness).
- **Data Processing**: Clean and preprocess the extracted data to prepare it for analysis.
- **Data Storage**: Store the collected data in a structured format, such as a CSV file or database, for later analysis.

#### Tools and Technologies
- **Python**: The primary programming language for web scraping and interacting with the API.
- **Requests**: A library for making HTTP requests, in case additional scraping is required.
- **Pandas**: A data manipulation library for handling and analyzing the extracted data.

#### Getting Started
1. **Set Up the Environment**: Install necessary libraries using pip (`praw`, `requests`, `pandas`).
2. **Obtain API Credentials**: Create a Reddit account and register an application to get API credentials (client ID, secret, and user agent).
3. **Define the Extraction Logic**: Write functions to extract data from specific subreddits or threads based on keywords or categories.
4. **Run the Scraper**: Launch the script and monitor the data collection process.
5. **Analyze the Data**: Use Pandas to analyze the collected posts and comments for insights.

#### Conclusion
This project provides a hands-on introduction to using Reddit's API and analyzing data with Python, while also allowing manipulation of data from a dynamic online community.


<p style="color:#FBCE60;text-align:center;font-size:30px"> Scraping Reddit's  Posts And Articles </p>

In [None]:
# Installing BeautifulSoup4, a Python library used for web scraping purposes 
# to extract data from HTML and XML files.
!pip install bs4  

# Installing Selenium, a tool used for automating web browsers, often used for tasks 
# like automated testing, web scraping, and web interactions.
!pip install selenium





[notice] A new release of pip is available: 24.0 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip





[notice] A new release of pip is available: 24.0 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip





### Scraping Reddit's Health related Topics

In [2]:
# Importation des bibliothèques nécessaires pour le scraping web, 
# l'automatisation de navigateur et le traitement de données.
import urllib.request  # Pour gérer les requêtes URL
from selenium import webdriver  # Pour l'automatisation du navigateur
from selenium.webdriver.common.by import By  # Pour spécifier les stratégies de localisation d'éléments
from selenium.webdriver.chrome.service import Service  # Pour configurer le service ChromeDriver
from selenium.webdriver.chrome.options import Options  # Pour définir des options pour Chrome
from selenium.webdriver.support.ui import WebDriverWait  # Pour gérer les attentes dans Selenium
from bs4 import BeautifulSoup  # Pour analyser les documents HTML et XML
import time  # Pour gérer les délais et les horodatages
from datetime import datetime, timezone  # Pour gérer le temps et les horodatages
from urllib.request import urlopen, Request  # Pour effectuer des requêtes HTTP
import pandas as pd  # Pour la manipulation et le stockage des données

# Horodatage actuel
now = datetime.now(timezone.utc)  # Capture l'heure actuelle en UTC
posts = []  # Liste pour stocker les données collectées sur les posts

# Fonction pour sauvegarder périodiquement les données collectées dans un fichier CSV
def periodicSave(data):
    data = pd.DataFrame(data)  # Convertit les données en DataFrame
    data.to_csv("./redditPosts.csv")  # Sauvegarde sous forme de fichier CSV
    print("Sauvegarde périodique effectuée, Total des posts sauvegardés :", len(data))  # Log de confirmation

# Fonction pour collecter des posts des subreddits liés à la santé
def collectSubRedditsPosts(url):
    time.sleep(5)  # Pause de 5 secondes pour éviter de surcharger le serveur
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.5615.137 Safari/537.36'}
    request = Request(url, headers=headers)  # Création de la requête HTTP avec un User-Agent
    
    # Ouvre l'URL et lit le contenu de la page
    with urlopen(request) as response:
        page_source = response.read()  # Lecture du contenu de la page
    print(f"Navigation vers {url}")  # Log de navigation

    # Analyse de la source de la page avec BeautifulSoup
    soup = BeautifulSoup(page_source, 'html.parser')

    # Recherche et traitement des posts
    posts_elements = soup.find_all("div", class_="thing")  # Trouve tous les posts
    for post_element in posts_elements:
        try:
            post = {  # Récupération des détails de chaque post
                "authorName": post_element.get("data-author"),  # Nom de l'auteur
                "authorId": post_element.get("data-author-fullname"),  # ID de l'auteur
                "commentCount": post_element.get("data-comments-count"),  # Nombre de commentaires
                "commentsLink": post_element.get("data-url"),  # Lien vers les commentaires
                "createdAt": post_element.get("data-timestamp"),  # Horodatage du post
                "postId": post_element.get("id"),  # ID du post
                "postTitle": post_element.find("a", class_="title may-blank").get_text(),  # Titre du post
                "subredditName": post_element.get("data-subreddit-prefixed"),  # Nom du subreddit
                "collectedAt": now.strftime("%Y-%m-%dT%H:%M:%S.") + str(now.microsecond).ljust(6, '0') + "+0000",  # Horodatage de la collecte
                "interactionCategory": post_element.find("span", class_="flairrichtext").get_text("title")
                if post_element.find("span", class_="flairrichtext") else "N/A",  # Catégorie d'interaction
            }
            print(post)  # Affiche le post collecté
            posts.append(post)  # Ajoute le post à la liste
            if len(posts) % 500 == 0:  # Sauvegarde périodique tous les 500 posts
                periodicSave(posts)
        except:
            continue  # Ignore les erreurs et passe au post suivant
    try:
        nextButton = soup.find("span", class_="next-button").find("a").get("href")  # Recherche du bouton "Suivant"
        if nextButton:
            collectSubRedditsPosts(nextButton)  # Navigation vers la page suivante
    except:
        print("Aucun bouton 'Suivant' trouvé")  # Log si aucune page suivante
        return posts  # Retourne les posts collectés


In [10]:
# Importation de pandas pour manipuler les données et time pour gérer les délais
import pandas as pd
import time

# Chargement du fichier CSV contenant la liste des communautés Reddit sur la santé
topicsList = pd.read_csv("./healthRedditCommunities.csv")  # Chemin vers le fichier CSV

def findCommunity(communityName):
    # Iterate through the topicsList to find the matching topic
    for index in range(len(topicsList)):
        
        if topicsList.loc[index,"topicName"] == communityName:
            return topicsList.loc[index]
    # If no match is found, return a message
    return "Unable to find. Please choose another one."


In [13]:
def scrapCommunity(communityName):
        collectedPosts=[]
        topic = findCommunity(communityName) 
        print(topic)
        topicName = topic["topicName"]  # Nom du sujet
        baseUrl = topic["topicUrl"]  # URL de base du sujet

        # Liste des différentes URLs à explorer pour collecter les posts
        extendedUrls = [
            baseUrl,
            baseUrl + "/new/",  # Nouveaux posts
            baseUrl + "/rising/",  # Posts en hausse
            baseUrl + "/controversial/",  # Posts controversés
            baseUrl + "/controversial/?sort=controversial&t=all",  # Controversés (tous)
            baseUrl + "/controversial/?sort=controversial&t=month",  # Controversés (mois)
            baseUrl + "/controversial/?sort=controversial&t=year",  # Controversés (année)
            baseUrl + "/controversial/?sort=controversial&t=week",  # Controversés (semaine)
            baseUrl + "/controversial/?sort=controversial&t=hour",  # Controversés (heure)
            baseUrl + "/top/",  # Posts les plus populaires
            baseUrl + "/top/?sort=controversial&t=all",  # Populaires controversés (tous)
            baseUrl + "/top/?sort=controversial&t=month",  # Populaires controversés (mois)
            baseUrl + "/top/?sort=controversial&t=year",  # Populaires controversés (année)
            baseUrl + "/top/?sort=controversial&t=week",  # Populaires controversés (semaine)
            baseUrl + "/top/?sort=controversial&t=hour"  # Populaires controversés (heure)
        ]

        # Parcours de chaque URL étendue pour collecter les posts
        for topicUrl in extendedUrls:
            # Appel de la fonction `collectSubRedditsPosts` pour chaque URL
            posts = collectSubRedditsPosts(topicUrl)
            collectedPosts.extend(posts)
        return collectedPosts
        


In [15]:
scrapCommunity("Anxiety")

topicName                             Anxiety
topicUrl     https://old.reddit.com/r/anxiety
Name: 3, dtype: object
Navigation vers https://old.reddit.com/r/anxiety
{'authorName': 'Pi25', 'authorId': 't2_jz8fo', 'commentCount': '7', 'commentsLink': '/r/Anxiety/comments/1gd9nuu/elections_and_politics/', 'createdAt': '1730031476000', 'postId': 'thing_t3_1gd9nuu', 'postTitle': 'Elections and Politics', 'subredditName': 'r/Anxiety', 'collectedAt': '2024-12-02T16:06:25.771722+0000', 'interactionCategory': 'N/A'}
{'authorName': 'AutoModerator', 'authorId': 't2_6l4z3', 'commentCount': '5', 'commentsLink': '/r/Anxiety/comments/1gx5a4f/monthly_checkin_thread/', 'createdAt': '1732273235000', 'postId': 'thing_t3_1gx5a4f', 'postTitle': 'Monthly Check-In Thread', 'subredditName': 'r/Anxiety', 'collectedAt': '2024-12-02T16:06:25.771722+0000', 'interactionCategory': 'N/A'}
{'authorName': 'Sensitive_Lock8059', 'authorId': 't2_g4yyu0yj', 'commentCount': '17', 'commentsLink': '/r/Anxiety/comments/1h4ufzy

TypeError: 'NoneType' object is not iterable