# Project Description

This project was created to web scrape data from the UCSD Student Organization Website which contains over 500 student organizations. As part of my work at the UCSD Career Center, my responsibility was to create a database which only extracts the academic program (ex: Undergraduate or Graduate), the purpose of the club, the official club email, and all club board emails for specific clubs related to what our Associate Directors oversee. This was to be organized into a Google Spreadsheet, but due to the formatting of the website, this data was not easily transferrable and over 2000 emails had to be manually transported. 

In essence, even though all of the data currently exists, this project will filter out only the necessary data needed from our staff based on a specified url. 

#### Note: 

While this project and demo is only coded to work with one base url currently (which can be re-assigned to the url of the club page that the user desires), I plan on continuing this project to be able to loop through and access all 500+ student organizations and later store all of this information in a new CSV file. 

The demo below will show a case which specifies the ACM AI club page as it's base url and extract all of the needed data.

#### Citations: 

Since this project focused on webscraping/parsing data and using HTML related properties, I conducted research across multiple websites to understand how to build my code. The following sites were referenced throughout this project.

1. To parse data: https://realpython.com/beautiful-soup-web-scraper-python/ 
2. Referenced in order to extract data from certain tags: https://stackoverflow.com/questions/32475700/using-beautifulsoup-to-extract-specific-dl-and-dd-list-elements

3. To explain (but not write) code: Chatgpt

## Project Code

If it makes sense for your project, you can have code and outputs here in the notebook as well.

In [7]:
base_url = 'https://studentorg.ucsd.edu/Home/Details/15823'
from my_module.functions import ext_ac_program, ext_purpose, ext_club_email, ext_board_emails

In [8]:
#demo extracting academic program from ACM AI club page
ext_ac_program(base_url)

'Undergraduate'

In [9]:
#demo extracting purpose from ACM AI club page
ext_purpose(base_url)

'                At ACM AI, our goals are to increase and promote interest in Artificial Intelligence at UCSD. We believe learning about AI should be accessible for everyone, and to this end, we want to lower the barrier for entry for folks interested in learning about AI. We aim to do this by hosting activities such as technical workshops, research panels, and AI competitions to both enhance the skills of our members and develop a connected community of scholars interested in AI. We hope that through member participation in our events, they will learn about the applications of AI in academia and industry, and moreover, we hope that everyone can learn about the hands-on, fun aspects of AI.            '

In [10]:
#demo extracting club email from ACM AI club page
ext_club_email(base_url)

'ai@acmucsd.org'

In [11]:
#demo extracting list of board emails from ACM AI club page
ext_board_emails(base_url)

['stao@ucsd.edu', 'jzamoraanaya@ucsd.edu', 'jpiepkorn@ucsd.edu']

In [12]:
# test it out
!pytest

platform linux -- Python 3.9.5, pytest-7.2.2, pluggy-1.0.0
rootdir: /home/n3huynh/Final_Project_COGS18_SP23
plugins: anyio-3.2.1
collected 4 items                                                              [0m[1m

my_module/test_functions.py [32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m                                         [100%][0m



#### Extra Credit (*optional*)

1. Prior to this class, I had little to no coding background (I used Matlab for simple matrix operations for my Linear Algebra class alone). 


2. I challenged myself to do this project since it was something I would need to use in my current job outside of class. It required a lot of research on my end on how to develop an understanding of how to apply BeautifulSoup, pandas, read html tags and be able to inspect a page, learn about nextSibling (and finding that it was not applicable for what I needed), parsing data, NavigableStrings, etc. which was outside the scope of this class. I had also learned how to extract all 500+ urls from this site, but was unable to include it as I will likely continue to work on this project to figure out how to access content in these urls using a loop and other functions. 