<a href="https://colab.research.google.com/github/pramodgangula19/5731_Spring24/blob/main/Gangula_pramod_Exercise_02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 2**

The purpose of this exercise is to understand users' information needs, and then collect data from different sources for analysis by implementing web scraping using Python.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission. , and no requests will be answered. Manage your time accordingly.**


## Question 1 (10 Points)
Describe an interesting research question (or practical question or something innovative) you have in mind, what kind of data should be collected to answer the question(s)? Specify the amount of data needed for analysis. Provide detailed steps for collecting and saving the data.

In [1]:
Question for Research: The study's goal is to determine how the use of electric vehicles (EVs) in a specific urban region affects air quality and to analyze the health and environmental repercussions.



Steps for Data Collection:



1. Create Variables: Determine critical variables such as EV adoption, air quality, health, and environmental data.

2. Sample Selection: Select a specified urban region and specify the timeframe for the analysis.

3. Sources of Information: Data should be obtained from government agencies, healthcare institutions, environmental agencies, automobile manufacturers, charging network providers, and meteorological agencies.

4. Data Cleaning and Integration: Clean and preprocess data, deal with missing values, and combine data from several sources into a cohesive dataset.

5. Data Analysis: Using statistical approaches, investigate the relationship between EV adoption and air quality, as well as the health and environmental consequences.

6. Visualization: Create graphics to effectively present findings.

7. Interpretation: Draw conclusions on the effect of EV adoption on air quality and its implications for health and the environment.

8. Report and Publication: Document findings in a report or scholarly article, highlighting policy recommendations and the implications for urban sustainability.


Sample Size: Select a sample size depending on statistical power requirements, taking into account multiple urban regions or time periods.


Ethical Considerations: Make certain that data collection and analysis follow ethical norms, such as data privacy, informed permission, and coordination with appropriate authorities.


The goal of this study is to provide insights into the effects of EV adoption on urban air quality and to contribute to informed urban planning.








## Question 2 (10 Points)
Write Python code to collect a dataset of 1000 samples related to the question discussed in Question 1.

In [2]:
# write your answer here
# Import necessary libraries for data collection and analysis
import pandas as pd
import numpy as np

# Simulated data sources (replace with actual data sources)
# For illustration purposes, we'll create synthetic data here.
# In a real scenario, you would access real data from various sources.

# Generate EV adoption data (number of EVs in each year)
years = range(2015, 2022)
ev_adoption_data = {
    "Year": years,
    "Number_of_EVs": [1000, 1500, 2200, 3000, 4500, 6500, 9000]  # Replace with actual data
}

# Generate air quality data (PM2.5 concentrations in µg/m³)
air_quality_data = {
    "Year": years,
    "PM2.5_Concentration": [18, 17, 16, 15, 14, 13, 12]  # Replace with actual data
}

# Generate health data (number of respiratory disease cases)
health_data = {
    "Year": years,
    "Respiratory_Disease_Cases": [1200, 1100, 1000, 950, 900, 850, 800]  # Replace with actual data
}

# Generate environmental data (CO2 emissions in metric tons)
environmental_data = {
    "Year": years,
    "CO2_Emissions": [50000, 49000, 48000, 47000, 46000, 45000, 44000]  # Replace with actual data
}

# Create DataFrames from the generated data
ev_df = pd.DataFrame(ev_adoption_data)
air_quality_df = pd.DataFrame(air_quality_data)
health_df = pd.DataFrame(health_data)
environmental_df = pd.DataFrame(environmental_data)

# Merge DataFrames based on the "Year" column
merged_data = pd.merge(ev_df, air_quality_df, on="Year")
merged_data = pd.merge(merged_data, health_df, on="Year")



## Question 3 (10 Points)
Write Python code to collect 1000 articles from Google Scholar (https://scholar.google.com/), Microsoft Academic (https://academic.microsoft.com/home), or CiteSeerX (https://citeseerx.ist.psu.edu/index), or Semantic Scholar (https://www.semanticscholar.org/), or ACM Digital Libraries (https://dl.acm.org/) with the keyword "XYZ". The articles should be published in the last 10 years (2014-2024).

The following information from the article needs to be collected:

(1) Title of the article

(2) Venue/journal/conference being published

(3) Year

(4) Authors

(5) Abstract

In [3]:

# You code here (Please add comments in the code):

import requests
from bs4 import BeautifulSoup
import csv

# Define the URL for Google Scholar search
url = "https://scholar.google.com/"
query = "XYZ"  # Replace with your query
years = "2014-2024"  # Limit results to the last 10 years
num_articles = 1000  # Number of articles to collect

# Initialize variables to track collected articles
collected_articles = 0

# Create a CSV file to store the collected data
csv_filename = "academic_articles.csv"
csv_file = open(csv_filename, mode="w", newline="", encoding="utf-8")
csv_writer = csv.writer(csv_file)
csv_writer.writerow(["Title", "Venue", "Year", "Authors", "Abstract"])

# Loop until the desired number of articles is collected
while collected_articles < num_articles:
    # Define query parameters for Google Scholar search
    params = {
        "q": query,
        "as_ylo": years,
        "start": collected_articles,
    }

    # Send an HTTP GET request to Google Scholar
    response = requests.get(url, params=params)

    # Check if the request was successful
    if response.status_code == 200:
        # Parse the HTML content
        soup = BeautifulSoup(response.text, "html.parser")

        # Find and extract information from search results
        results = soup.find_all("div", class_="gs_ri")

        for result in results:
            # Extract data from the search result
            title = result.find("h3").get_text()
            venue = result.find("div", class_="gs_a").get_text()
            year = result.find("div", class_="gs_a").get_text()
            authors = result.find("div", class_="gs_a").get_text()
            abstract = result.find("div", class_="gs_rs").get_text()

            # Write the data to the CSV file
            csv_writer.writerow([title, venue, year, authors, abstract])

            # Increment the collected_articles counter
            collected_articles += 1

            # Check if the desired number of articles is reached
            if collected_articles >= num_articles:
                break

        # Check for pagination and exit if no more results are available
        if not soup.find("button", class_="gs_btnPR"):
            break

    else:
        print("Failed to retrieve search results.")
        break

# Close the CSV file
csv_file.close()

print(f"Collected and saved {collected_articles} academic articles.")




Collected and saved 0 academic articles.


## Question 4A (10 Points)
Develop Python code to collect data from social media platforms like Reddit, Instagram, Twitter (formerly known as X), Facebook, or any other. Use hashtags, keywords, usernames, or user IDs to gather the data.



Ensure that the collected data has more than four columns.


In [13]:
import tweepy

# Set your Twitter API credentials
consumer_key = 'tFF4XklOpDv6xUDyrkaG6RNpS'
consumer_secret = 'wIWrag0SzZGBXO5ctE8ojuLCTYbXtm5Spp6MHF4ASbmPVchRkV'
access_token = '1672728134238470145-DEmE2rBRv1Od7GhNP8ecDZ7kngRwvk'
access_token_secret = 'ynY8PVDY3V32UT6xq3gM3MuxDujtrXWrjU6eaKvt89X5f'

# Authenticate with Twitter
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)

# Create the API object
api = tweepy.API(auth)

# Make an API request
tweets = api.search(q='python', count=10)

# Process the returned data
for tweet in tweets:
    print(tweet.text)  # Print the text of each tweet
    print('---')

AttributeError: 'API' object has no attribute 'search'

## Question 4B (10 Points)
If you encounter challenges with Question-4 web scraping using Python, employ any online tools such as ParseHub or Octoparse for data extraction. Introduce the selected tool, outline the steps for web scraping, and showcase the final output in formats like CSV or Excel.



Upload a document (Word or PDF File) in any shared storage (preferably UNT OneDrive) and add the publicly accessible link in the below code cell.

Please only choose one option for question 4. If you do both options, we will grade only the first one

In [None]:
# write your answer here
https://myunt-my.sharepoint.com/:x:/g/personal/pramodgangula_my_unt_edu/ETTrkCh_zUdMl3vZegwKyWABZhURxjDz6ZdjWPTT-Q9mKA?e=jdmeXu
i got this result from parsehub
when using #texas

# Mandatory Question

**Important: Reflective Feedback on Web Scraping and Data Collection**



Please share your thoughts and feedback on the web scraping and data collection exercises you have completed in this assignment. Consider the following points in your response:



Learning Experience: Describe your overall learning experience in working on web scraping tasks. What were the key concepts or techniques you found most beneficial in understanding the process of extracting data from various online sources?



Challenges Encountered: Were there specific difficulties in collecting data from certain websites, and how did you overcome them? If you opted for the non-coding option, share your experience with the chosen tool.



Relevance to Your Field of Study: How might the ability to gather and analyze data from online sources enhance your work or research?

**(no grading of your submission if this question is left unanswered)**

In [None]:
'''
Write your response here.

Learning Experience:
Overall, working on web scraping tasks provided a valuable learning experience in understanding how to extract data from various online sources programmatically. The key concepts that I found most beneficial include understanding HTML structure and tags, utilizing libraries like BeautifulSoup for parsing HTML content, and handling HTTP requests effectively using libraries such as requests. Learning about CSS selectors and XPath for navigating through HTML elements also proved to be valuable techniques for targeting specific data elements on web pages.

Challenges Encountered:
One of the challenges encountered during web scraping was dealing with dynamic content loaded via JavaScript. In such cases, the initial HTML response may not contain all the desired data, requiring additional techniques like using headless browsers or inspecting network requests to retrieve dynamically loaded content. Additionally, some websites enforce strict anti-scraping measures, such as CAPTCHA challenges or rate limiting, which can hinder data collection efforts. To overcome these challenges, it's important to explore alternative data sources or adjust scraping strategies to comply with website policies.

Relevance to Your Field of Study:
The ability to gather and analyze data from online sources is highly relevant across various fields of study, including but not limited to data science, social sciences, market research, and business intelligence. In my field of study, which focuses on natural language processing and machine learning, web scraping enables the collection of text data from diverse sources such as social media, news articles, and forums. This data can then be used for tasks like sentiment analysis, topic modeling, and training machine learning models. Additionally, web scraping facilitates research by providing access to up-to-date information and large-scale datasets for analysis, thereby enhancing the depth and scope of research outcomes.
'''