# Generating Fake CSV Data with Python using Faker

We wil be emulating some of the free datasets from Kaggle, in particular the Netflix original films IMDB score to generate something similar.

## Prerequisites:

- Familiarity with `Pipenv`. See here [Pipenv](https://docs.pipenv.org/basics/)
- Familiarity with `JupyterLab`. See here [JupyterLab](https://jupyterlab.readthedocs.io/en/stable/)

## Getting Started

1. Install `Pipenv` if you don't have it installed already. You can install it using `pip`:

```bash
pip install pipenv
```

2. Create a new directory and navigate into it:

```bash
mkdir fake-csv-data
cd fake-csv-data
```

3. Create a new `Pipenv` environment:

```bash
pipenv --python 3.10
```

4. Install the required packages:

```bash
pipenv install pandas faker
```

5. Start JupyterLab:

```bash
pipenv run jupyter lab
```

6. The server will now be up and running. Navigate to the URL provided in the terminal to access JupyterLab.

Example:

```bash
http://localhost:8888/lab/workspaces/auto-I
```

7. Creating a new notebook:

Once on http://localhost:8888/lab, select to create a new Python 3 notebook from the launcher.

> Note: Ensure that this notebook is saved in generating-fake-csv-data-with-python/docs/generating-fake-data.ipynb.

We will create four cells to handle four parts of this mini project:

1. Importing Faker and generating data.
2. Importing the CSV module and exporting the data to a CSV file.

**Before generating our data, we need to look at what we are trying to emulate.**

### Emulating The Netflix Original Movies IMDB Scores Dataset
Looking at the preview for our dataset, we can see that it contains the following columns and example rows:

| Title | Genre | Premiere | Runtime | IMDB Score | Language |
| ------| ----- | -------- | ------- | ---------- | -------- |
| Enter the Anime | Documentary | August 5, 2019 | 58 | 2.5 | English/Japanese |
| Dark Forces | Thriller | August 21, 2020 | 81 | 2.6 | Spanish |

We only have two rows for example, but from here we can make a few assumptions about how we want to emulate it.

1. In our languages, we will stick to a single language (unlike the example English/Japanese).
2. IMDB scores are between 1 and 5. We wonâ€™t be too harsh on any movies and go from 0.
3. Runtimes should emulate a real movie â€” we can set it to be between 50 and 150 minutes.
4. Genres may be something we need to write our own Faker provider for.
5. We are going to be okay with non-sense data, so we can just use a string generator for the names.

With this said, letâ€™s look at how we can fake this.

## Emulating a value for each column

We will create seven cells â€” one to import Faker and one for each column.

1. For the first cell, we will import Faker.

In [1]:
from faker import Faker

fake = Faker()

2. We will fake a movie name with words:

In [2]:
def capitalize(str):
    return str.capitalize()
words = fake.words()
capitalized_words = list(map(capitalize, words))
movie_name = ' '.join(capitalized_words)
print(movie_name)  # Real Hair Month

Real Hair Month


3. We will generate a date this decade and use the same format as the example:

In [3]:
from datetime import datetime

date = datetime.strftime(fake.date_time_this_decade(), "%B %d, %Y")
print(date)  # April 27, 2024

April 27, 2024


4. We will create our own fake data generator for the genre:

In [4]:
# creating a provider for genre
from faker.providers import BaseProvider
import random

# create new provider class
class GenereProvider(BaseProvider):
    def movie_genre(self):
        return random.choice(['Documentary', 'Thriller', 'Mystery', 'Horror', 'Action', 'Comedy', 'Drama', 'Romance'])

# then add new provider to faker instance
fake.add_provider(GenereProvider)

# now you can use:
movie_genre = fake.movie_genre()
print(movie_genre) # Horror


Horror


5. We will do the same for a language:

In [5]:
# creating a provider for genre
from faker.providers import BaseProvider
import random

# create new provider class
class LanguageProvider(BaseProvider):
    def language(self):
        return random.choice(['English', 'Chinese', 'Italian', 'Spanish', 'Hindi', 'Japanese'])

# then add new provider to faker instance
fake.add_provider(LanguageProvider)

# now you can use:
language = fake.language()
print(language) # Hindi


Hindi


6. We need to generate a runtime (in minutes) between 50 and 150:

In [6]:
# Getting random movie length
movie_len = random.randrange(50, 150)
print(movie_len) # 133

133


7. Lastly, we need a rating with one decimal point between 1.0 and 5.0:

In [7]:
# Movie rating
random_rating = round(random.uniform(1.0, 5.0), 1)
print(random_rating) # 4.0

4.0


## Generating the CSV
Now that we have all our information together, it is time to generate a CSV with 1000 entries.

We can place everything we know into a last cell to generate some data:

In [8]:
from faker import Faker
from faker.providers import BaseProvider
import random
import csv

class GenereProvider(BaseProvider):
    def movie_genre(self):
        return random.choice(['Documentary', 'Thriller', 'Mystery', 'Horror', 'Action', 'Comedy', 'Drama', 'Romance'])

class LanguageProvider(BaseProvider):
    def language(self):
        return random.choice(['English', 'Chinese', 'Italian', 'Spanish', 'Hindi', 'Japanese'])

fake = Faker()

fake.add_provider(GenereProvider)
fake.add_provider(LanguageProvider)

# Some of this is a bit verbose now, but doing so for the sake of completion

def get_movie_name():
    words = fake.words()
    capitalized_words = list(map(capitalize, words))
    return ' '.join(capitalized_words)

def get_movie_date():
    return datetime.strftime(fake.date_time_this_decade(), "%B %d, %Y")

def get_movie_len():
    return random.randrange(50, 150)

def get_movie_rating():
    return round(random.uniform(1.0, 5.0), 1)

def generate_movie():
    return [get_movie_name(), fake.movie_genre(), get_movie_date(), get_movie_len(), get_movie_rating(), fake.language()]

with open('movie_data.csv', 'w') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(['Title', 'Genre', 'Premiere', 'Runtime', 'IMDB Score', 'Language'])
    for n in range(1, 1000):
        writer.writerow(generate_movie())

## Summary

Todayâ€™s Tutorial demonstrated how to use the `Faker package` to generate fake data and the `CSV library` to export that data to file.

In future, we may use this data to make our data sets to work with and some some data science around.

`Kaggle` and `Open Data` are great resources for data and data visualization for any use you may also have when not generating your own data.


Thanks for reading! Happy coding! ðŸš€

Follow me on [GitHub - julioaranajr ](https://github.com/julioaranajr) for more updates and tutorials like this.
