# Collecting Data:
As you know, machines initially learn from the data that you give them. It is of the utmost importance to collect reliable data so that your machine learning model can find the correct patterns. The quality of the data that you feed to the machine will determine how accurate your model is. If you have incorrect or outdated data, you will have wrong outcomes or predictions which are not relevant.  

Make sure you use data from a reliable source, as it will directly affect the outcome of your model. Good data is relevant, contains very few missing and repeated values, and has a good representation of the various subcategories/classes present. 

## The most common data sources to collect data for a ML model:
__1.__ Open Source Datasets

__2.__ Web Scraping

__3.__ Synthetic Datasets

__4.__ Manual Data Generation

## 1. Open Source Datasets
In simple terms, Open Data means the kind of data which is open for anyone and everyone for access, modification, reuse, and sharing.
Open Data derives its base from various “open movements” such as open source, open hardware, open government, open science etc.
Governments, independent organizations, and agencies have come forward to open the floodgates of data to create more and more open data for free and easy access.

**Some of the Open Data Sources are:-** 
 - Google Dataset Search
 - Kaggle
 - Data.Gov
 - Datahub.io
 - UCI Machine Learning Repository
 - Earth Data
 - CERN Open Data Portal
 - Global Health Observatory Data Repository
 - BFI film industry statistics
 - NYC Taxi Trip Data
 - FBI Crime Data Explorer
 - World Bank Open Data
 - WHO (World Health Organization) — Open data repository
 - Google Public Data Explorer
 - Registry of Open Data on AWS (RODA)
 - European Union Open Data Portal
 - FiveThirtyEight
 - U.S. Census Bureau
 - DBpedia
 - freeCodeCamp Open Data
 - Yelp Open Datasets
 - UNICEF Dataset
 - LODUM
 - Dataverse
 - Open Data Kit
 - Ckan
 - Open Data Monitor
 - Plenar.io
 - Open Data Impact Map

## Web Scraping to extract tabular data.

In [11]:
import pandas as pd
df = pd.read_html("https://en.wikipedia.org/wiki/Premier_League_records_and_statistics") ## Just Paste the URL which has some tabular data
df[6]

Unnamed: 0,Pos.,Club,Seasons,Pld,W,D,L,GF,GA,GD,...,4th,5th,6th,7th,T4,T7,Debut,Since/Last App.,Relegated,Best Pos.
0,1,Manchester United,30,1152,703,257,192,2185,1066,1119,...,1.0,1.0,3.0,1.0,25.0,30.0,1992–93,1992–93[a],,1
1,2,Arsenal,30,1152,619,284,249,2017,1148,869,...,7.0,4.0,1.0,,21.0,26.0,1992–93,1992–93[b],,1
2,3,Chelsea,30,1152,618,284,250,1973,1125,848,...,4.0,2.0,4.0,,19.0,25.0,1992–93,1992–93[c],,1
3,4,Liverpool,30,1152,609,282,261,2021,1147,874,...,7.0,2.0,3.0,3.0,19.0,27.0,1992–93,1992–93[d],,1
4,5,Tottenham Hotspur,30,1152,502,281,369,1745,1438,307,...,4.0,5.0,2.0,2.0,7.0,16.0,1992–93,1992–93[e],,2
5,6,Manchester City,25,962,473,210,279,1658,1068,590,...,1.0,1.0,,,12.0,13.0,1992–93,2002–03,2.0,1
6,7,Everton,30,1152,418,320,414,1491,1481,10,...,1.0,3.0,3.0,4.0,1.0,11.0,1992–93,1992–93[f],,4
7,8,Newcastle United,27,1034,382,264,388,1377,1417,−40,...,1.0,2.0,1.0,1.0,5.0,9.0,1993–94,2017–18,2.0,2
8,9,Aston Villa,27,1038,354,296,388,1265,1353,−88,...,1.0,1.0,6.0,1.0,2.0,10.0,1992–93,2019–20,1.0,2
9,10,West Ham United,26,996,335,253,408,1235,1429,−194,...,,1.0,1.0,3.0,,5.0,1993–94,2012–13,2.0,5


In [14]:
import pandas as pd
df = pd.read_html("https://www.geeksforgeeks.org/difference-between-symmetric-and-asymmetric-key-encryption/") ## Just Paste the URL which has some tabular data
df[0]

Unnamed: 0,Symmetric Key Encryption,Asymmetric Key Encryption
0,It only requires a single key for both encrypt...,"It requires two keys, a public key and a priva..."
1,The size of cipher text is the same or smaller...,The size of cipher text is the same or larger ...
2,The encryption process is very fast.,The encryption process is slow.
3,It is used when a large amount of data is requ...,It is used to transfer small amounts of data.
4,It only provides confidentiality.,"It provides confidentiality, authenticity, and..."
5,The length of key used is 128 or 256 bits,The length of key used is 2048 or higher
6,"In symmetric key encryption, resource utilizat...","In asymmetric key encryption, resource utiliza..."
7,It is efficient as it is used for handling lar...,It is comparatively less efficient as it can h...
8,Security is less as only one key is used for b...,It is more secure as two keys are used here- o...
9,The Mathematical Representation is as follows-...,The Mathematical Representation is as follows-...


## Synthetic dataset
Synthetic data is information that's artificially manufactured rather than generated by real-world events. It's created algorithmically and is used as a stand-in for test data sets of production or operational data, to validate mathematical models and to train machine learning (ML) models.

**How is synthetic data generated?**

The process of generating synthetic data differs by the tools and algorithms used and the specific use case.
The following are three common techniques used for creating synthetic data:

1. **Drawing numbers from a distribution.** Randomly selecting numbers from a distribution is a common method for creating synthetic data. Although this method doesn't capture the insights of real-world data, it can produce a data distribution that closely resembles real-world data.
2. **Agent-based modeling.** This simulation technique involves creating unique agents that communicate with one another. These methods are especially helpful when examining how different agents -- such as mobile phones, people or even computer programs -- interact with one another in a complex system. Using pre-built core components, Python packages, such as Mesa, make it easier to quickly develop agent-based models and view them via a browser-based interface.
3. **Generative models.** These algorithms can generate synthetic data that replicates the statistical properties or features of real-world data. Generative models use a set of training data to learn the statistical patterns and relationships in the data and then use this knowledge to generate new synthetic data that's similar to the original data. Examples of generative models include generative adversarial networks and variational autoencoders.


#### What are examples of synthetic data?
Synthetic data is used across many different industries for various use cases. The following are some examples of synthetic data applications:

1. **Media data.** In this use case, computer graphics and image processing algorithms are used to generate synthetic images, audio and video. For example, Amazon uses synthetic data to train Amazon Alexa's language system.
Text data. This can include chatbots, machine translation algorithms and sentimental analysis based on artificially generated text data. ChatGPT is an example of a tool that uses text data.
Tabular data. This consists of synthetically generated data tables used for data analysis, model training and other applications.
2. **Unstructured data.** Unstructured data can include images, video and audio data that are mostly employed in fields such as computer vision, speech recognition and autonomous vehicle technology. For example, Google's Waymo uses synthetic data to train its self-driving cars.
3. **Financial services data.** The financial sector relies heavily on synthetic data, especially for fraud detection, risk management and credit risk assessments. For example, JPMorgan and American Express use synthetic financial data to improve fraud detection.
4. **Manufacturing data.** The manufacturing industry uses synthetic data for quality control testing and predictive maintenance. For instance, German insurance company Provinzial tests synthetic data for predictive analytics.