<a href="https://colab.research.google.com/github/madhan444-s/Madhan_INFO5731_Spring2024/blob/main/Dadi_Madhan_Exercise_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 2**

The purpose of this exercise is to understand users' information needs, and then collect data from different sources for analysis by implementing web scraping using Python.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission. , and no requests will be answered. Manage your time accordingly.**


## Question 1 (10 Points)
Describe an interesting research question (or practical question or something innovative) you have in mind, what kind of data should be collected to answer the question(s)? Specify the amount of data needed for analysis. Provide detailed steps for collecting and saving the data.

In [None]:
# write your answer here
'''
Research Question: Does exposure to personalized news feeds on social media lead to less common ground and more polarized discussions during offline social interactions?

Data needed are
1.Individual data:
a.Demographics (age, gender, location)
b. Social media usage patterns (time spent, platform preference)
c. News consumption preferences (topics, sources)
d. Personality traits (openness to experience, agreeableness)


2.Social interaction data:
a. Recordings/transcripts of conversations during offline social gatherings
b. Pre- and post-interaction surveys:
c. Topics discussed
d. Level of agreement/disagreement
e. Perception of common ground
f. Individual emotional state


Amount of data:
Ideally, data from hundreds of individuals across diverse demographics and social media usage patterns would be needed.
For each individual, multiple recordings/transcripts of offline interactions in different settings (family, friends, colleagues) would be valuable.
Pre- and post-interaction surveys should be conducted for each interaction.

Data collection steps:
1. Recruit participants: Advertise the study on social media and relevant online communities, targeting diverse demographics.
2. Collect individual data: Use online surveys to gather demographic information, social media usage patterns, and personality assessments.
3. Track news consumption: Utilize browser extensions or dedicated apps to track websites and articles participants access.
4. Record/transcribe offline interactions: Participants wear recording devices or take detailed notes during social gatherings, later anonymized and transcribed.
5. Administer pre- and post-interaction surveys: Participants answer questions about topics discussed, agreement/disagreement, perceived common ground, and emotional state before and after each interaction.

Data saving and security:
Store all data securely and anonymously, following ethical guidelines and data privacy regulations.
Use encrypted databases and password protection.
Obtain informed consent from participants regarding data collection and usage.


## Question 2 (10 Points)
Write Python code to collect a dataset of 1000 samples related to the question discussed in Question 1.

In [2]:
# write your answer here
import pandas as pd
import random
# Generate synthetic data for 1000 participants
participants = []
for _ in range(1000):
    participant = {
        'ID': _ + 1,
        'Age': random.randint(18, 65),
        'Gender': random.choice(['Male', 'Female', 'Other']),
        'Location': random.choice(['Urban', 'Suburban', 'Rural']),
        'SocialMediaUsage': random.randint(1, 10),  # hours per day
        'PlatformPreference': random.choice(['Facebook', 'Twitter', 'Instagram']),
        'NewsTopics': random.choice(['Politics', 'Technology', 'Health', 'Sports']),
        'NewsSources': random.choice(['CNN', 'BBC', 'NY Times', 'BuzzFeed']),
        'OpennessToExperience': random.uniform(1, 5),
        'Agreeableness': random.uniform(1, 5),
    }
    participants.append(participant)

# Convert data to DataFrame
df_individual = pd.DataFrame(participants)

# Generate synthetic data for social interactions
interactions = []
for participant_id in range(1, 1001):
    for _ in range(random.randint(2, 5)):  # random number of interactions per participant
        interaction = {
            'ID': participant_id,
            'Setting': random.choice(['Family', 'Friends', 'Colleagues']),
            'TopicsDiscussed': random.choice(['Politics', 'Technology', 'Movies', 'Food']),
            'LevelOfAgreement': random.choice(['High', 'Medium', 'Low']),
            'PerceivedCommonGround': random.choice(['Yes', 'No']),
            'EmotionalStatePre': random.choice(['Happy', 'Neutral', 'Angry']),
            'EmotionalStatePost': random.choice(['Happy', 'Neutral', 'Angry']),
        }
        interactions.append(interaction)

# Convert data to DataFrame
df_interactions = pd.DataFrame(interactions)

# Save data to CSV files
df_individual.to_csv('individual_data.csv', index=False)
df_interactions.to_csv('interaction_data.csv', index=False)


## Question 3 (10 Points)
Write Python code to collect 1000 articles from Google Scholar (https://scholar.google.com/), Microsoft Academic (https://academic.microsoft.com/home), or CiteSeerX (https://citeseerx.ist.psu.edu/index), or Semantic Scholar (https://www.semanticscholar.org/), or ACM Digital Libraries (https://dl.acm.org/) with the keyword "XYZ". The articles should be published in the last 10 years (2014-2024).

The following information from the article needs to be collected:

(1) Title of the article

(2) Venue/journal/conference being published

(3) Year

(4) Authors

(5) Abstract

In [63]:
# write your answer here
import requests
from bs4 import BeautifulSoup
from datetime import datetime, timedelta
import pandas as pd
import re


def getGoogleScholarArticles(keyword, maxResults=1000):
    baseUrl = 'https://scholar.google.com/scholar'
    params = {
        'q': keyword,
        'hl': 'en',
        'as_sdt': '0,5',
    }

    articles = []
    count = 0
    while count < maxResults:
        response = requests.get(baseUrl, params=params)
        soup = BeautifulSoup(response.text, 'html.parser')

        for result in soup.find_all('div', class_='gs_ri'):
            title = result.find('h3', class_='gs_rt').text
            venue = result.find('div', class_='gs_a').text
            authors = result.find('div', class_='gs_a').text.split('-')[0].strip()
            #year = result.find('div', class_='gs_a').text.split('-')[-1].strip()
# Extracting the year more reliably
            yearMatch = re.search(r'\b\d{4}\b', venue)
            year = yearMatch.group(0) if yearMatch else 'N/A'

            abstract = result.find('div', class_='gs_rs')
            abstract = abstract.text if abstract else 'N/A'

            articles.append({
                'Title': title,
                'Venue': venue,
                'Authors': authors,
                'Year': year,
                'Abstract': abstract
            })

            count += 1
            if count >= maxResults:
                break

        next_button = soup.find('button', class_='gs_btnPR gs_in_ib gs_btn_half gs_btn_lsb gs_btn_srt gsc_pgn_pnx')
        if not next_button:
            break
        params['start'] = count

    return articles

def filterByDate(articles, start_year, end_year):
    filteredArticles = []
    for article in articles:
        try:
            year = int(article['Year'])
            if start_year <= year <= end_year:
                filteredArticles.append(article)
        except ValueError:
            # Handle non-integer values gracefully (e.g., skip the article)
            pass
    return filteredArticles

    return [article for article in articles if start_year <= int(article['Year']) <= end_year]

def main():
    keyword = "XYZ"
    maxResults = 1000
    start_year = 2014
    end_year = 2024

    articles = getGoogleScholarArticles(keyword, maxResults)
    filteredArticles = filterByDate(articles, start_year, end_year)

    df = pd.DataFrame(filteredArticles)
    df.to_csv('articles_data.csv', index=False)

if __name__ == "__main__":
    main()


## Question 4A (10 Points)
Develop Python code to collect data from social media platforms like Reddit, Instagram, Twitter (formerly known as X), Facebook, or any other. Use hashtags, keywords, usernames, or user IDs to gather the data.



Ensure that the collected data has more than four columns.


In [65]:

import praw
import pandas as pd

# Set up your Reddit API credentials
redditClientId = 'czZQXwmEfuATnQ4TdsNQ_Q'
redditClientSecret = 'dEL-Sr2eSlGDUteyjQbgcv7Q1yCZRg'
redditUserAgent = 'MyRedditApp/1.0 by Ok_Abbreviations8589'

# Authenticate with Reddit API
reddit = praw.Reddit(
    client_id=redditClientId,
    client_secret=redditClientSecret,
    user_agent=redditUserAgent
)

def collect_subreddit_posts(subredditName, numPosts):
    posts = []
    subreddit = reddit.subreddit(subredditName)

    for submission in subreddit.top(limit=numPosts):
        posts.append([submission.created_utc, submission.author.name, submission.title, submission.score, submission.num_comments])

    return posts

# Example usage
subredditToSearch = 'python'
noOfPosts = 10

collectedPosts = collect_subreddit_posts(subredditToSearch, noOfPosts)

# Create a DataFrame from the collected data
columns = ['Timestamp', 'Author', 'Title', 'Score', 'Num_comments']
df = pd.DataFrame(collectedPosts, columns=columns)

# Display the DataFrame
print(df)


It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.



      Timestamp               Author  \
0  1.587424e+09               iEslam   
1  1.594386e+09           Krukerfluk   
2  1.588945e+09      jessjwilliamson   
3  1.513644e+09           backprop88   
4  1.571943e+09  janky_british_gamer   
5  1.599933e+09           paulkaefer   
6  1.589972e+09            Itwist101   
7  1.594632e+09                atqm-   
8  1.589235e+09               Nekose   
9  1.598100e+09           HotTeenBoy   

                                               Title  Score  Num_comments  
0  Lad wrote a Python script to download Alexa vo...  12348           133  
1                                     This post has:   9233           437  
2  I redesign the Python logo to make it more modern   7865           266  
3     Automate the boring stuff with python - tinder   6715           327  
4  Just finished programming and building my own ...   6606           469  
5  I'm excited to share my first published book, ...   6498           249  
6  Drawing Mona Lisa with 2

## Question 4B (10 Points)
If you encounter challenges with Question-4 web scraping using Python, employ any online tools such as ParseHub or Octoparse for data extraction. Introduce the selected tool, outline the steps for web scraping, and showcase the final output in formats like CSV or Excel.



Upload a document (Word or PDF File) in any shared storage (preferably UNT OneDrive) and add the publicly accessible link in the below code cell.

Please only choose one option for question 4. If you do both options, we will grade only the first one

In [None]:
# write your answer here


# Mandatory Question

**Important: Reflective Feedback on Web Scraping and Data Collection**



Please share your thoughts and feedback on the web scraping and data collection exercises you have completed in this assignment. Consider the following points in your response:



Learning Experience: Describe your overall learning experience in working on web scraping tasks. What were the key concepts or techniques you found most beneficial in understanding the process of extracting data from various online sources?



Challenges Encountered: Were there specific difficulties in collecting data from certain websites, and how did you overcome them? If you opted for the non-coding option, share your experience with the chosen tool.



Relevance to Your Field of Study: How might the ability to gather and analyze data from online sources enhance your work or research?

**(no grading of your submission if this question is left unanswered)**

In [None]:
'''
Write your response here.
'''