# This is chapter 3 of Data Engineering with python practice code

### One of the most basic tasks in data engineering is moving data from a text file to a database. We will read data from and write data to several different text-based formats, such as CSV and JSON. 

## Writing and reading files in python. 

###  To write data, you will use a library named <font color= blue> faker </font> . faker allows you to easily create fake data for common fields. You can generate an address by simply calling <font color= green > address() </font> , or a female name using name_female(). This will simplify the creation of fake data while at the same time making it more realistic. 



## Writing CSVs using the python CSV Library.

> ##### Open a file in writing mode:-
>> 'w' write 
>> 'a' --> apend 
>> 'r' --> read

In [14]:
# The command below creates a file if it doesn't exist and makes it "writing mode" or 
#if the file exits already it changes it to "writing mode"
output = open('myCSV.CSV', mode='w')


# Note - writing more "w" writes data to a file after deleting data from a file. 

In [15]:
# Create a CSV_writer

import csv

mywriter = csv.writer(output)

In [16]:
#Create header to the file that you created and want to write data in

header = ['name', 'age']
mywriter.writerow(header)

10

In [17]:
#Write data to the file using a variable name "data"

data0 = ['Bob Smith', 40]
mywriter.writerow(data0)



#I added the below three row data just to see more records in the file. 
data1 = ['Larry Smith', 35]
mywriter.writerow(data1)

data2 = ['Joe Peter', 26]
mywriter.writerow(data2)

data3 = ['Simon Sam', 46]
mywriter.writerow(data3)

output.close()

In [18]:
#This step is not in the book. In the book, "cat mycsv.csv" command is executed. 
#I added this so that I can see the result here in notebook. 

import pandas as pd

In [19]:
df = pd.read_csv('myCSV.CSV')

In [None]:
df

In [20]:
from faker import Faker
import csv

output=open('data.CSV','w')
fake=Faker()
header=['name','age','street','city','state','zip','lng','lat']
mywriter=csv.writer(output)
mywriter.writerow(header)
for r in range(1000):
    mywriter.writerow([fake.name(),fake.random_int(min=18, max=80, step=1), 
                       fake.street_address(), fake.city(),fake.state(),
                       fake.zipcode(),fake.longitude(),fake.latitude()])

output.close()

In [21]:
#This step 

df = pd.read_csv('data.CSV')

In [None]:
df









## Reading CSVs

Reading CSV is somewhat similar to writing csv file. The same steps are followed with slight modifications:


The with statement automatically takes care of closing the file once it leaves the with block, even in cases of error. I highly recommend that you use the with statement as much as possible, as it allows for cleaner code and makes handling any unexpected errors easier for you.

Most likely, you’ll also want to use the second positional argument, mode. This argument is a string that contains multiple characters to represent how you want to open the file. The default and most common is 'r', which represents opening the file in read-only mode as a text file:

    Example
> with open('dog_breeds.txt', 'r') as reader:

In [None]:


with open('data.CSV') as f:
    myreader = csv.DictReader(f)

    headers=next(myreader)
    
    for row in myreader:
        print(row['name'])

In [None]:
## We could also do it this way. 




In [None]:
import csv


f = open('data.CSV')

myreader = csv.DictReader(f)

headers =next(myreader)

In [None]:
headers =next(myreader)

In [None]:
print(header)

In [None]:
for row in myreader:
    print(row['age'])

# Reading and writing CSVs using pandas DataFrames

pandas DataFrames are a powerful tool not only for reading and writing data but also for
the querying and manipulation of data. It does require a larger overhead than the built-in
CSV library, but there are times when it may be worth the trade-off. You may already have
pandas installed, depending on your Python environment, but if you do not, you can
install it with the following:

>pip3 install pandas

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv('data.CSV')

In [None]:
#To read the top 10 rows of the file
df.head(10)


# This only shows the first 5 rows. 
# df.head()

In [None]:
#To read the bottom 10 rows of the file
df.tail(10)

In [None]:

#This command shows how many rows and columns
print(df.shape)

In [None]:
#This command shows what the column names are
print(df.columns)

In [None]:
#To get the datatypes of each columns
print(df.dtypes)

In [None]:

#To get more information about the data
print(df.info())

In [None]:
#To get the specific column
country_df = df['name']

In [None]:
#.head() gives the first 5 rows
print(country_df.head())

In [None]:
#To get the last 5 rows
print(country_df.tail())

In [None]:
#To get more than one column
subset = df[['name', 'age', 'street']]

In [None]:
#To het the 1st 5 rows of these columns
print(subset.head())

In [None]:
#To get the last 5 columns of these columns
print(subset.tail())

In [None]:
#To get rows by index level
#The code below prints the first row of the data frame
df.loc[10]

In [None]:
#To get the 1st, 100th, and 1000th rows from the 1st, 4th and 6th comumns
print(df.iloc[[0,99,999], [0,3,5]])

In [None]:
#To get the last row
print(df.tail(n=1))

In [None]:
#What is the difference between "loc" and "iloc" --> The answer is [-1] and [1]

In [None]:
subset = df.loc[:, ['name','age']]


In [None]:
print(subset.head())

In [None]:

subset = df.iloc[:, [2, 4, -1]]

In [None]:

print(subset.head())

# 


# You can create a DataFrame in Python with the following steps:


### Create a dictionary of data. A dictionary is a data structure that stores data as a key:value pair.

In [None]:
data={'Name':['Paul','Bob','Susan','Yolanda'], 'Age':[23,45,18,21]}

In [None]:
df=pd.DataFrame(data)

In [None]:
df.to_csv('fromdf.CSV',index=False)

In [None]:
print(df)

In [None]:
dff = pd.read_csv('fromdf.CSV')

In [None]:
print(dff)

You will now have a CSV file with the contents of the <font color= red> DataFrame </font>. How we can use the
contents of this DataFrame for executing SQL queries will be covered in the next chapter.
They will become an important tool in your toolbox and the rest of the book will lean on
them heavily.

<font color= white> skip </font>









# Writing JSON with python


Another common data format you will probably deal with is **JavaScript Object Notation
<font color= red> (JSON) </font>**. You will see JSON most often when making calls to Application Programming
Interfaces (APIs); however, it can exist as a file as well. How you handle the data is very
similar no matter whether you read it from a file or an API. Python, as you learned with
CSV, has a standard library for handling JSON data, not surprisingly named JSON–JSON.

In [None]:
# Import the library and open the file you will write to. You also create the Faker object:

from faker import Faker
import json

output=open('data.JSON','w')

fake=Faker()

In [None]:
alldata={}
alldata['records']=[]

In [None]:
for x in range(1000):
    data={"name":fake.name(),"age":fake.random_int (min=18, max=80, step=1), 
          "street":fake.street_address(), "city":fake.city(),
          "state":fake.state(), "zip":fake.zipcode(), "lng":float(fake.longitude()), 
          "lat":float(fake.latitude())}

    alldata['records'].append(data)

In [None]:
json.dump(alldata,output)

In [None]:
print(alldata)

In [None]:
with open('data.JSON','r') as f:
    #breakpoint()
    
    data=json.load(f)
    
    data['records'][100]
    
    

In [None]:
df=pd.read_json('data.JSON')

In [None]:
df