Smart_Sheet

In this project, a google service account will act as an enabler and connector in building a simple Extract, Transform, and Load Pipeline.

The project involves scraping and transforming the data from yahoo finance using Python, storing the data on Google Sheets, and then visualizing the data using Google Data Studio.

The Project is part of my "Building your first Google Cloud Analytics Project"

Link to the Yahoo Finance Website

Link to the Medium Article

Project Structure

Set up of Compute Engine Instance
Installation of Anaconda
Installation of Google Chrome and the Chrome Driver
Installation of D
Undestanding the Components of the Scripts
Installing Required Libraries
Running the Python script
Running the Cronjob

Set up of Compute Engine Instance

The project was built using a virtual machine, for set up; please refer to:

Installation of Anaconda

Go to the anaconda’s download page, look for the linux version, and then copy the link address. On the terminal of the instance, you can then use wget ‘the link address’ to download anaconda just as seen in the command below. The installation of anaconda would automatically lead to the installation of python 3 as well as jupyter notebook.

    wget https://repo.anaconda.com/archive/Anaconda3-2022.10-Linux-x86_64.sh

The above command installs anaconda on the virtual machine; the ff command is then used to install it:

   bash Anaconda3-2022.10-Linux-x86_64.sh

You might need to restart the instance before. To test that anaconda was installed properly.

   conda --version

You can then launch jupyter notebook

   jupyter notebook

To close down the jupyter notebook, press Ctrl+D on windows, CMD+D on mac

Installation of Google Chrome and the Chrome Driver

Installation of Docker

Docker will help us package your application and all it dependencies into a container in which we can then deploy. To install docker use

   sudo apt-get install docker.io

Check the version you installed using docker —version

In other to ensure that docker runs without sudo;check the link

Now try; docker run hello-world In this case; hello-world is a docker image. In the case you don’t have the docker image locally, docker pulls the image from the docker registry and builds the container on your local machine.

Understand the Python Scripts

main.py: This python script is used to extract the data from Yahoo Finance. It uses selenium and beautiful soup which are bith web scraping tools to pull the data from the website and save it as a python dataframe

sheet_connect.py: This python script contains the functionality that sets up the connection from Python to Google Sheet.

send_mail.py: This python script was set up for logging and error management.It contains funtionalities that informs us when the pipeline breaks, and what caused the break.

push.py: This python script was used to push the first batch of data from yahoo finance to google sheet. It is mainly used to test if the connection between python and the google sheet actually works.

push_append.py: This is the python script that combines the functionality of all the other scripts. The other scripts are imported as modules, and their various functionalities are combined towards ensuring that the data is scraped, transformed, and transfered to google sheet.

Installing Required Libraries

The requirements.txt file contains all the libraries that are used for the project. It was generated after the completion of the projects using pipreqs.

   pip install -r requirements.txt

Running the Python script

From your working directory; use the command below

   python3 push_append.py

Setting up the CronJob

This is one of the major reasons, we are using a google cloud virtual image, we want to be able to run the pipeline every minute. To set the cronjob, open the crontab using the command below:

   crontab -e

Scroll down and then input this command as seen below:

   * * * * * export key_file=/home/macbook/.service_accounts_keys/name_of_service_account_key.json; /home/macbook/anaconda3/bin/python3 ~/Smart_Sheet/push_append.py

The command above instructs our machine to run the append.py using the python from the anaconda installation directory. An environmental variable is also passed as this is a dependency for the python script to run successfully.

When running cronjobs, it is very important to use the absolute path of all files and resources involved. There are some exceptions.

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
airflow_setup		airflow_setup
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
cloudbuild.yaml		cloudbuild.yaml
linux_selenium_chrome_installer.sh		linux_selenium_chrome_installer.sh
main.py		main.py
push.py		push.py
push_append.py		push_append.py
requirements.txt		requirements.txt
send_mail.py		send_mail.py
sheet_connect.py		sheet_connect.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Smart_Sheet

Project Structure

Set up of Compute Engine Instance

Installation of Anaconda

Installation of Google Chrome and the Chrome Driver

Installation of Docker

Understand the Python Scripts

Installing Required Libraries

Running the Python script

Setting up the CronJob

About

Releases

Packages

Languages

paulonye/Smart_Sheet

Folders and files

Latest commit

History

Repository files navigation

Smart_Sheet

Project Structure

Set up of Compute Engine Instance

Installation of Anaconda

Installation of Google Chrome and the Chrome Driver

Installation of Docker

Understand the Python Scripts

Installing Required Libraries

Running the Python script

Setting up the CronJob

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages