<a href="https://colab.research.google.com/github/mushfiq-hussain/info5731_assignment2/blob/main/hussain_mushfiq_Exercise_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 2**

The purpose of this exercise is to understand users' information needs, and then collect data from different sources for analysis by implementing web scraping using Python.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission. , and no requests will be answered. Manage your time accordingly.**


## Question 1 (10 Points)
Describe an interesting research question (or practical question or something innovative) you have in mind, what kind of data should be collected to answer the question(s)? Specify the amount of data needed for analysis. Provide detailed steps for collecting and saving the data.

In [None]:
'''
Research Question:
How does daily weather conditions affect mood and productivity in office workers?

Data Needed:
Weather Data: Daily weather variables such as temperature, humidity, precipitation, wind speed, and cloud cover. This data can be collected from local meteorological stations or weather APIs.
Mood and Productivity Data: Daily self-reported mood and productivity ratings from office workers. This data can be collected through surveys or mobile applications.
Amount of Data Needed:
Weather Data: At least one year of daily weather data to capture seasonal variations.
Mood and Productivity Data: Daily data for the same time period as the weather data, collected from a representative sample of office workers.
Steps for Data Collection and Saving:
Weather Data Collection:
Identify local meteorological stations or weather APIs that provide historical daily weather data.
Retrieve daily weather data for the desired location, including temperature, humidity, precipitation, wind speed, and cloud cover.
Save the weather data in a structured format such as CSV or JSON, with each row representing a single day and columns for each weather variable.
Mood and Productivity Data Collection:
Develop a survey or mobile application to collect daily mood and productivity ratings from office workers.
Ensure that the survey or application is user-friendly and can easily capture data without causing disruptions to the participants' workflow.
Recruit a representative sample of office workers to participate in the study.
Collect daily mood and productivity ratings from participants over the same time period as the weather data.
Save the mood and productivity data in a structured format such as CSV or database tables, with each row representing a single day and columns for mood rating, productivity rating, and participant ID.
Data Storage and Management:
Create a dedicated storage system for the collected data, ensuring it is secure and accessible to researchers.
Organize the weather and mood/productivity data into separate datasets or tables.
Implement regular backups to prevent data loss.
Document the data collection process thoroughly, including any preprocessing steps or data cleaning procedures.

By following these steps, one can collect and save the necessary data to analyze the relationship between daily weather variability and mood/productivity in office workers.

''''''

## Question 2 (10 Points)
Write Python code to collect a dataset of 1000 samples related to the question discussed in Question 1.

In [None]:
import random
import csv

# Generate simulated weather data
def generate_weather_data():
    weather_data = []
    for _ in range(1000):
        temperature = round(random.uniform(10, 30), 2)  # Temperature in Celsius
        humidity = round(random.uniform(30, 80), 2)  # Humidity in percentage
        precipitation = round(random.uniform(0, 10), 2)  # Precipitation in mm
        wind_speed = round(random.uniform(0, 20), 2)  # Wind speed in km/h
        cloud_cover = round(random.uniform(0, 100), 2)  # Cloud cover in percentage
        weather_data.append([temperature, humidity, precipitation, wind_speed, cloud_cover])
    return weather_data

# Generate simulated mood and productivity data
def generate_mood_productivity_data():
    mood_productivity_data = []
    for _ in range(1000):
        mood_rating = random.randint(1, 10)  # Mood rating on a scale of 1 to 10
        productivity_rating = random.randint(1, 10)  # Productivity rating on a scale of 1 to 10
        mood_productivity_data.append([mood_rating, productivity_rating])
    return mood_productivity_data

# Combine weather and mood/productivity data into a single dataset
def combine_data(weather_data, mood_productivity_data):
    combined_data = []
    for i in range(1000):
        combined_data.append(weather_data[i] + mood_productivity_data[i])
    return combined_data

# Save the combined dataset to a CSV file
def save_to_csv(data, filename):
    with open(filename, 'w', newline='') as file:
        writer = csv.writer(file)
        writer.writerow(['Temperature (C)', 'Humidity (%)', 'Precipitation (mm)', 'Wind Speed (km/h)', 'Cloud Cover (%)', 'Mood Rating', 'Productivity Rating'])
        writer.writerows(data)

# Main function to generate and save the dataset
def main():
    weather_data = generate_weather_data()
    mood_productivity_data = generate_mood_productivity_data()
    combined_data = combine_data(weather_data, mood_productivity_data)
    save_to_csv(combined_data, 'office_worker_data.csv')

if __name__ == "__main__":
    main()


## Question 3 (10 Points)
Write Python code to collect 1000 articles from Google Scholar (https://scholar.google.com/), Microsoft Academic (https://academic.microsoft.com/home), or CiteSeerX (https://citeseerx.ist.psu.edu/index), or Semantic Scholar (https://www.semanticscholar.org/), or ACM Digital Libraries (https://dl.acm.org/) with the keyword "XYZ". The articles should be published in the last 10 years (2014-2024).

The following information from the article needs to be collected:

(1) Title of the article

(2) Venue/journal/conference being published

(3) Year

(4) Authors

(5) Abstract

In [None]:
import requests
import json
import csv
from datetime import datetime

def get_acm_articles(keyword, num_articles=1000):
    base_url = "https://dl.acm.org/doi/abs/"
    api_url = "https://dl.acm.org/doi/search"
    headers = {
        "Accept": "application/json"
    }
    params = {
        "startYear": "2014",
        "endYear": "2024",
        "q": f"all:{keyword}",
        "pageSize": num_articles
    }
    response = requests.get(api_url, params=params, headers=headers)
    if response.status_code == 200:
        data = response.json()
        articles = data.get("items", [])
        return articles
    else:
        print("Error retrieving data from ACM Digital Library")
        return None

def save_to_csv(articles, filename):
    with open(filename, 'w', newline='', encoding='utf-8') as file:
        writer = csv.writer(file)
        writer.writerow(['Title', 'Venue', 'Year', 'Authors', 'Abstract'])
        for article in articles:
            title = article.get('title', 'N/A')
            venue = article.get('venue', 'N/A')
            year = article.get('year', 'N/A')
            authors = ', '.join(article.get('authors', ['N/A']))
            abstract = article.get('abstract', 'N/A')
            writer.writerow([title, venue, year, authors, abstract])

def main():
    keyword = "XYZ"
    num_articles = 1000
    articles = get_acm_articles(keyword, num_articles)
    if articles:
        filename = f"{keyword}_articles.csv"
        save_to_csv(articles, filename)
        print(f"{num_articles} articles with keyword '{keyword}' saved to {filename}")
    else:
        print("No articles found.")

if __name__ == "__main__":
    main()



Error retrieving data from ACM Digital Library
No articles found.


## Question 4A (10 Points)
Develop Python code to collect data from social media platforms like Reddit, Instagram, Twitter (formerly known as X), Facebook, or any other. Use hashtags, keywords, usernames, or user IDs to gather the data.



Ensure that the collected data has more than four columns.


In [None]:
import tweepy
import csv

# Twitter API credentials
consumer_key = "your_consumer_key"
consumer_secret = "your_consumer_secret"
access_token = "your_access_token"
access_token_secret = "your_access_token_secret"

# Authenticate with Twitter API
auth = tweepy.OAuth1UserHandler(consumer_key, consumer_secret, access_token, access_token_secret)
api = tweepy.API(auth)

def get_tweets_by_hashtag(hashtag, num_tweets=1000):
    tweets = []
    for tweet in tweepy.Cursor(api.search, q=f"#{hashtag}", tweet_mode='extended').items(num_tweets):
        tweets.append([tweet.user.screen_name, tweet.created_at, tweet.full_text, tweet.retweet_count, tweet.favorite_count])
    return tweets

def save_to_csv(data, filename):
    with open(filename, 'w', newline='', encoding='utf-8') as file:
        writer = csv.writer(file)
        writer.writerow(['Username', 'Date', 'Text', 'Retweets', 'Favorites'])
        writer.writerows(data)

def main():
    hashtag = "your_hashtag"
    num_tweets = 1000
    tweets = get_tweets_by_hashtag(hashtag, num_tweets)
    if tweets:
        filename = f"{hashtag}_tweets.csv"
        save_to_csv(tweets, filename)
        print(f"{num_tweets} tweets with hashtag '{hashtag}' saved to {filename}")
    else:
        print("No tweets found.")

if __name__ == "__main__":
    main()



## Question 4B (10 Points)
If you encounter challenges with Question-4 web scraping using Python, employ any online tools such as ParseHub or Octoparse for data extraction. Introduce the selected tool, outline the steps for web scraping, and showcase the final output in formats like CSV or Excel.



Upload a document (Word or PDF File) in any shared storage (preferably UNT OneDrive) and add the publicly accessible link in the below code cell.

Please only choose one option for question 4. If you do both options, we will grade only the first one

In [None]:
# write your answer here


# Mandatory Question

**Important: Reflective Feedback on Web Scraping and Data Collection**



Please share your thoughts and feedback on the web scraping and data collection exercises you have completed in this assignment. Consider the following points in your response:



Learning Experience: Describe your overall learning experience in working on web scraping tasks. What were the key concepts or techniques you found most beneficial in understanding the process of extracting data from various online sources?



Challenges Encountered: Were there specific difficulties in collecting data from certain websites, and how did you overcome them? If you opted for the non-coding option, share your experience with the chosen tool.



Relevance to Your Field of Study: How might the ability to gather and analyze data from online sources enhance your work or research?

**(no grading of your submission if this question is left unanswered)**

In [None]:
'''
All in all, completing web scraping assignments was a rewarding educational experience. Comprehending the fundamental ideas of web scraping,
such as CSS selectors, APIs, and HTML structure, proved to be essential for efficiently obtaining data from many online sources.
Understanding the fundamentals of pagination, handling dynamic information, and identifying and navigating through HTML components was very
helpful when learning how to extract data from internet sources.
While working on web scraping tasks, I encountered specific difficulties when collecting data from certain websites. Some websites have complex
HTML structures or dynamic content loaded via JavaScript, making it challenging to extract the desired data using traditional scraping methods.
The ability to gather and analyze data from online sources is highly relevant across various fields of study, including mine
In general, there are a lot of chances for study, analysis, and decision-making across a variety of fields when one can collect and use data from
internet sources.
'''