<a href="https://colab.research.google.com/github/gtakhil95/Akhil_INFO5731_Fall2024/blob/main/Gundampalli_Akhil_Exercise_02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 2**

The purpose of this exercise is to understand users' information needs, and then collect data from different sources for analysis by implementing web scraping using Python.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission. , and no requests will be answered. Manage your time accordingly.**


## Question 1 (10 Points)
Describe an interesting research question (or practical question or something innovative) you have in mind, what kind of data should be collected to answer the question(s)? Specify the amount of data needed for analysis. Provide detailed steps for collecting and saving the data.

The research question is, "Do social media trends on road accidents occur annually in the US?"

  1.  Identify Key Data Sources
We should focus on gathering data from platforms where people commonly discuss road accidents:
    *  Twitter: Posts (tweets) related to road accidents.
    *  Facebook: Public posts and groups focused on traffic safety or road incidents.
    *  Reddit: Subreddits related to accidents, news, and public safety.
    *  Instagram: Posts tagged with keywords related to road accidents.
    *  News websites that post on social media: These may also have road accident reports that get shared widely.

  2.  Define Relevant Timeframe
   To capture annual trends, we need to collect data from a minimum of  3 to 5 years to observe recurring patterns. It would be ideal to have data covering different seasons each year, since road accidents may follow seasonal patterns.

  3.  Steps for Data Collection

Step 1:  Develop a Data Collection Plan

    * Choose platforms: Identify which platforms are most likely to have relevant discussions. We can prioritize Reddit due to their open-access APIs.
    *  Select tools: Use tools such as:
      *  Social media APIs (Twitter API, Reddit API).
      *  Web scraping tools (BeautifulSoup for scraping Instagram posts, or Scrapy for forums).
      *  Sentiment analysis tools for processing text (VADER or TextBlob).

Step 2:  Keyword Identification and Refinement

    * Compile a list of keywords and hashtags commonly associated with road accidents.
    * Use a test run to gather a small dataset and refine the keyword list if necessary (remove irrelevant or vague terms).

Step 3:  Data Collection

    *  API/Tool setup: Create accounts and set up access to social media platforms’ APIs.
    *  Gather historical data: If possible, retrieve historical data from social media platforms using keywords or hashtags.
    *  Ensure US Focus: Filter posts or mentions to only include those from the US, possibly using geotagged data or language filters (for mentions of US cities or states).
    *  Frequency: Set the collection to grab data at regular intervals (e.g., daily, weekly) for continuous monitoring over several years.

Step 4:  Data Storage and Organization

    *  Store data in a structured format: Save the collected data in formats such as CSV or JSON for easy analysis.

  4.  Amount of Data Needed
    *  Sample size: We need to aim for at least 10,000 to 100,000 social media posts per year to have a sufficient sample size for analysis, depending on the platform.
    *  Length of study: Collect data from 3 to 5 years to detect annual trends


## Question 2 (10 Points)
Write Python code to collect a dataset of 1000 samples related to the question discussed in Question 1.

In [19]:
pip install praw




In [20]:
import praw
import pandas as pd
import warnings

warnings.filterwarnings("ignore")

# Step 1: Set up Reddit API connection using PRAW
reddit = praw.Reddit(client_id='F4rfCMsg7srHMauT3EauVA',
                     client_secret='Yj8hwZkknqi_xpWGvB6FgF3EGDLmzQ',
                     user_agent='Big_Jaguar_8911')

# Step 2: Specify subreddits and keywords
subreddits = ['carcrash', 'roadcam', 'cars', 'publicfreakout', 'news', 'AskReddit', 'idiotsincars']
keywords = ['road accident', 'car crash', 'traffic accident', 'collision', 'hit and run']

# Step 3: Function to search Reddit for posts
def search_reddit(subreddit, keyword, limit=200):
    posts = []
    subreddit_instance = reddit.subreddit(subreddit)

    # Search posts within the subreddit using the keyword
    for submission in subreddit_instance.search(keyword, limit=limit):
        posts.append({
            'title': submission.title,
            'score': submission.score,
            'id': submission.id,
            'url': submission.url,
            'num_comments': submission.num_comments,
            'created': submission.created_utc,
            'body': submission.selftext,
            'subreddit': submission.subreddit.display_name
        })

    return posts

# Step 4: Collect data from multiple subreddits using keywords
all_posts = []
for subreddit in subreddits:
    for keyword in keywords:
        posts = search_reddit(subreddit, keyword, limit=200)  # Fetch posts with the limit
        all_posts.extend(posts)  # Add posts to the main dataset
        if len(all_posts) >= 1000:  # Stop collecting once we have 1000 posts
            break
    if len(all_posts) >= 1000:
        break

# Step 5: Convert the data into a pandas DataFrame
df = pd.DataFrame(all_posts)

# Step 6: Save the data into a CSV file for analysis
df.to_csv('reddit_road_accidents_dataset.csv', index=False)

print(f"Collected {len(df)} posts related to road accidents.")


It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/l

Collected 1068 posts related to road accidents.


## Question 3 (10 Points)
Write Python code to collect 1000 articles from Google Scholar (https://scholar.google.com/), Microsoft Academic (https://academic.microsoft.com/home), or CiteSeerX (https://citeseerx.ist.psu.edu/index), or Semantic Scholar (https://www.semanticscholar.org/), or ACM Digital Libraries (https://dl.acm.org/) with the keyword "XYZ". The articles should be published in the last 10 years (2014-2024).

The following information from the article needs to be collected:

(1) Title of the article

(2) Venue/journal/conference being published

(3) Year

(4) Authors

(5) Abstract

In [22]:
import csv
import requests

# SerpAPI setup
API_KEY = "f4d269e8a21b0b0cbec8cb8c4f5b515769ca927a65ff549eb0d67a665e92855b"
SEARCH_ENGINE = "google_scholar"

def search_google_scholar(query, num_results=1000):
    """Searches Google Scholar using SerpAPI and returns the articles."""
    params = {
        "engine": SEARCH_ENGINE,
        "q": query,
        "api_key": API_KEY,
        "num": num_results,  # Number of results to fetch
        "as_ylo": "2014",    # Start year of publication
        "as_yhi": "2024",    # End year of publication
    }

    response = requests.get("https://serpapi.com/search", params=params)
    data = response.json()
    return data.get("organic_results", [])

def save_articles_to_csv(articles, filename="articles.csv"):
    """Saves article information to a CSV file."""
    # Define CSV headers
    headers = ["Title", "Venue/Journal/Conference", "Year", "Authors", "Abstract"]

    # Open CSV file for writing
    with open(filename, mode="w", newline='', encoding='utf-8') as file:
        writer = csv.writer(file)
        writer.writerow(headers)  # Write header row

        # Write article rows
        for article in articles:
            title = article.get("title", "N/A")
            venue = article.get("publication_info", {}).get("venue", "N/A")
            year = article.get("publication_info", {}).get("year", "N/A")
            authors = ", ".join(article.get("authors", []))
            abstract = article.get("snippet", "N/A")

            writer.writerow([title, venue, year, authors, abstract])

def main():
    # Perform search on Google Scholar for keyword 'XYZ'
    query = "XYZ"
    articles = search_google_scholar(query)

    # Save the results to CSV
    save_articles_to_csv(articles)
    print(f"Saved {len(articles)} articles to 'articles.csv'.")

if __name__ == "__main__":
    main()


Saved 0 articles to 'articles.csv'.


## Question 4A (10 Points)
Develop Python code to collect data from social media platforms like Reddit, Instagram, Twitter (formerly known as X), Facebook, or any other. Use hashtags, keywords, usernames, or user IDs to gather the data.



Ensure that the collected data has more than four columns.


In [None]:
pip install praw




In [None]:
import praw
import pandas as pd

# Step 1: Set up Reddit API connection using PRAW
reddit = praw.Reddit(
    client_id='F4rfCMsg7srHMauT3EauVA',  # Replace with your client ID
    client_secret='Yj8hwZkknqi_xpWGvB6FgF3EGDLmzQ',  # Replace with your client secret
    user_agent='Big_Jaguar_8911'  # Replace with your user agent
)

# Step 2: Function to search Reddit posts using keywords
def search_reddit(keyword, limit=100):
    posts = []

    # Search Reddit for submissions containing the keyword
    for submission in reddit.subreddit("all").search(keyword, limit=limit):
        posts.append({
            'title': submission.title,
            'username': submission.author.name if submission.author else 'N/A',
            'post_id': submission.id,
            'url': submission.url,
            'num_comments': submission.num_comments,
            'score': submission.score,
            'created': submission.created_utc,
            'body': submission.selftext,
            'subreddit': submission.subreddit.display_name
        })

    return posts

# Step 3: Define search terms and collect data
keywords = ['road accident', 'car crash', 'traffic accident', 'collision']
data = []

# Collect data for each keyword
for keyword in keywords:
    keyword_data = search_reddit(keyword, limit=100)  # Fetch 100 posts per keyword
    data.extend(keyword_data)

# Step 4: Convert the collected data to a pandas DataFrame
df = pd.DataFrame(data)

# Ensure that the dataset has at least four columns (we have more than four columns here)
df = df[['title', 'username', 'post_id', 'url', 'num_comments', 'score', 'subreddit', 'created', 'body']]

# Step 5: Save the collected data into a CSV file for further analysis
df.to_csv('reddit_posts_data.csv', index=False)

print(f"Collected {len(df)} posts from Reddit related to the keywords: {', '.join(keywords)}")


It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.



Collected 400 posts from Reddit related to the keywords: road accident, car crash, traffic accident, collision


## Question 4B (10 Points)
If you encounter challenges with Question-4 web scraping using Python, employ any online tools such as ParseHub or Octoparse for data extraction. Introduce the selected tool, outline the steps for web scraping, and showcase the final output in formats like CSV or Excel.



Upload a document (Word or PDF File) in any shared storage (preferably UNT OneDrive) and add the publicly accessible link in the below code cell.

Please only choose one option for question 4. If you do both options, we will grade only the first one

# Mandatory Question

**Important: Reflective Feedback on Web Scraping and Data Collection**



Please share your thoughts and feedback on the web scraping and data collection exercises you have completed in this assignment. Consider the following points in your response:



Learning Experience: Describe your overall learning experience in working on web scraping tasks. What were the key concepts or techniques you found most beneficial in understanding the process of extracting data from various online sources?



Challenges Encountered: Were there specific difficulties in collecting data from certain websites, and how did you overcome them? If you opted for the non-coding option, share your experience with the chosen tool.



Relevance to Your Field of Study: How might the ability to gather and analyze data from online sources enhance your work or research?

**(no grading of your submission if this question is left unanswered)**

Reflective Feedback on Web Scraping and Data Collection

Learning Experience:
Working on web scraping tasks helps in understanding the various ways to gather data from online sources, enhancing skills in Python programming and data extraction techniques. Key concepts like handling APIs, parsing HTML and managing data storage are crucial.

Challenges Encountered:
Challenges  include handling dynamic web content, dealing with rate limits and managing data consistency. For question 3, could not gain API key for any websites except for Google Scholar which implied restriction of only 100 searches for a month. Hence, could not achieve the required result in the file.

Relevance to Your Field of Study:
Mastering data collection from online sources can greatly enhance research capabilities, providing valuable insights and supporting data-driven decision-making in various fields.