# Data Pipelines project: Your first data pipeline

## What?
The main idea is that you'll use web scraping to collect data from the internet.
You'll put this data together to create a dataset which you can use for future projects.

In your next class, you will be introduced to machine learning. You'll learn that the data is an enabling component which big machine learning systems depend on. You'll also learn that supervised learning is just finding a mathematical function which maps from input to output, and that the dataset defines these inputs and outputs. For this project, the dataset you collect should be supervised.

You should also use cron to schedule your Python web scraping script so that it runs automatically at intervals.

## Why?
Creating a dataset relevant to work that you want to do in the future will give you something uniquely interesting to companies that you might apply to.

By creating your own dataset, you can start working on an interesting problem that might be relevant to awesome people that you can reach out to. Later in the course, we'll help introduce you to people who can give advice on the kind of problems you're working on.

What problems do you think you might you be able to tackle using AI? What interests you? What kind of data would jobs you want to apply to value your experience in working with?

## Deliverables
- A Github repo containing all of the code
- Obviously, the dataset (probably not pushed to GitHub because it will be huge)
    - it must contain at least one numerical feature and one numerical or categorical label (so we can apply ML in the next unit)
    - at least 1000 examples
- A slideshow presentation explaining
    - the different locations on the internet that your script collects data from
    - the layout of a example webpages you scraped and how you targeted elements within them
    - in what format you chose to store your data and why (data lake vs data warehouse, file type, database table schema)
    - how you cleaned the data
    - suggestions of which variable in your dataset you will attempt to predict using some of the remaining variables
    - do not include screenshots of code
- a contribution to the datasets module of the ai_core Python libary [added late]

## Deadline
The project deadline is for 2 weeks from when announced.

## Marking criteria

Each of the bullets under the following headings will be scored as "not attempted" (0 points), "attempted" (1 point) or "successfully applied" (2 points).

## Readability
- Your repository has a readme file summarising what the project does, the motivation behind it, how to use it and what you achieved
- You added comments explaining each decision and docstrings for each function/class
- Your repo contains all the elements necessary to run your code and does not contain depricated files

### Programming
- code is object oriented with logical chunks of code within functions, and related functions encapsulated withing classes
- `.py` files

### Data storage
- Tabular data stored in a SQL database
- Raw data stored in a data lake (S3), if appropriate
- Images, if part of the dataset, are downloaded, stored in a data lake and named by example id, as in the tabular data.

### Data cleaning
- Your structured data contains a single variable per column
- You handled duplicates and nulls properly

## Presentation
- presentation is part of the repository
- presentation lasted between 5 and 11 minutes (please, rehearse)

## Gold stars
- Special prize for the individual who collects the largest dataset.
- Can you figure out how to run the scraper on a remote server using AWS (or another cloud provider) so that it doesn't run on your local machine? Don't worry, all the cloud providers give free credits.
- Your data contains way more than 100K rows
- Your data contains tabular data, free text & images
- You used 2 or more sources of data in your project
- used multithreading to accelerate the scraping