## Working with CSV Files

CSV stands for ***comma separated variables*** which is a common output for spreadsheet programs. These files are used in organizations for transmitting and storing data. They can be created in Notepad, Excel, etc...

### Example (text file with data separated by commas)
<img src = "../img/csvexample.png"
     height= "400px"
width= "720px">

### Example (same text file when extension changed to .csv)
<img src = "../img/csvexample1.png"
     height= "400px"
width= "720px">

NB: 
- All CSV files are just plain texts(meaning they can be opened and read using any text editor or word processor), 
- they contain alphanumeric characters, 
- they of lines of text, with each line representing a row of data in the table. Within each line, the values are separated by commas (hence the name "comma-separated").
- While CSV files can be opened and edited in programs like Excel, they are not the same format. CSV files do not have the advanced formatting capabilities of Excel files, such as font styles, colors, formulas, or data types. This means that in a CSV file, all values are treated as strings ***(meaning there is no inherent distinction between numbers, dates, or other data types within a CSV file)***. As a result, when opening a CSV file in a program like Excel, all values are initially treated as plain text.

---

Python has a built in csv module that will allows us to grab columns rows,and values from a .csv file as well as write to it.

There are other libraries to consider for working with csv files
1. Pandas
    - It is a full data analysis library
    - Runs visualizations and analysis
2. Openpyxl
    - designed specifically for Excel files
3. Google Sheets Python API

### Lets explore python's built-in library


In [3]:
# IMPORT CSV
import csv

## Reading CSV Files

In [31]:
# OPEN THE FILE
csv_file = open("example.csv")

In [32]:
# CALL csv.reader on the file
csv_reader = csv.reader(csv_file)

In [33]:
# REFORMAT it into a python object using list( )
data_lines = list(csv_reader)

#### Sometimes after running the previous line of code you might get an error.
<img src = "../img/unicodeerror.png"
     height= "400px"
width= "720px">

The error message means that there is a problem when trying to convert a sequence of bytes into readable text. The specific issue is related to a character in the byte sequence that cannot be understood by the default decoding method used on Windows systems.

To fix this issue, you add an ***encoding argument*** when opening the file and specify an encoding.

In a situation where you do not get this error it just means your computer's defautlt encoding matches the file.

##### Let's try this again....

In [79]:
# OPEN THE FILE (THIS TIME WE ADD AN ENCODING ARGUMENT)
csv_file = open("example.csv", encoding = "utf-8")

In [80]:
# CALL csv.reader on the file
csv_reader = csv.reader(csv_file)

In [82]:
csv_reader

<_csv.reader at 0x7fd272353660>

In [83]:
# REFORMAT it into a python object using list ( )
data_lines = list(csv_reader)

In [85]:
# print(data_lines)

When we check the result of ***data_lines***, we get a ***list of lists***(data structure that represents a collection of lists where each element in the main list is itself a separate list)

- The first item is the column names
- The second and remaining items are the data rows
<img src = "../img/data_lines.png"
     height= "400px"
width= "720px">

In [86]:
# Checking column names
data_lines[0]

['id', 'first_name', 'last_name', 'email', 'gender', 'ip_address', 'city']

In [87]:
# Checking the number of rows 
# (basically total items - 1) since one item represents the column
len(data_lines)

1001

In [88]:
# To get a nice format (use a for loop)
# I used index slicing here since the list to too long
for line in data_lines[:5]:
    print(line)

['id', 'first_name', 'last_name', 'email', 'gender', 'ip_address', 'city']
['1', 'Joseph', 'Zaniolini', 'jzaniolini0@simplemachines.org', 'Male', '163.168.68.132', 'Pedro Leopoldo']
['2', 'Freida', 'Drillingcourt', 'fdrillingcourt1@umich.edu', 'Female', '97.212.102.79', 'Buri']
['3', 'Nanni', 'Herity', 'nherity2@statcounter.com', 'Female', '145.151.178.98', 'Claver']
['4', 'Orazio', 'Frayling', 'ofrayling3@economist.com', 'Male', '25.199.143.143', 'Kungur']


### We can extract any item in the csv file 
#### we can extract....
- a row
- an item in row
- an item in a column

## Writing a CSV file

In [74]:
f = open("first_csv_file",mode = "w", newline = "")

#### This creates a file if the file name did not exist in the directory / It will overwrite any exisiting file with the same name

<img src = "../img/firstcsvfile.png"
     height= "400px"
width= "720px">

You may have noticed the newline argument, the ***newline parameter*** in Python controls how line endings are handled when reading or writing text files. 

NB: It applies to text mode only.(not applicable binary mode)


- If "newline" is set to None or an empty string (''), Python will automatically convert different line ending formats (like '\n', '\r', or '\r\n') into a single '\n' character. So, regardless of the line ending format in the input file, Python will make it consistent and use '\n' as the line ending representation in the returned data.
- If "newline" has any other value (e.g., '\r' or '\r\n'), Python will consider lines terminated only by that specific string. The line endings will be returned exactly as they are in the input file.

By explicitly setting `newline=''` when opening the file, you ensure that the correct line termination is used when writing rows with `csv_writer.writerow()`. 
This is important for maintaining compatibility with different operating systems and preventing issues.

In [75]:
# Call csv.writer(object) on the file
csv_writer = csv.writer(f)

In [76]:
# writing a single row
csv_writer.writerow(["Name","Age","Class"])

16

In [77]:
# writing mutiple rows (This takes in a list of lists)
# Number of items in rows must be consistent
csv_writer.writerows([["Selorm",17,11],["Jerome",19,13]])

In [78]:
f.close()

### You should see this when you open the file
<img src = "../img/firstcsvfile1.png"
     height= "400px"
width= "720px">

## Working with PDF files

PDF stands for Portable Document Format which was developed by Adobe.

Most PDFs that contain scanned documents or images are not machine-readable. Scanned PDFs are essentially images or pictures of documents, and they lack structured text data that can be easily extracted and processed by Python.

We are going to use a free open source library ***PyPDF2*** to read and extract pdfs.

NB: remember that most pdfs are not machine readable

### Working with PyPDF2

The first thing you need to do is install the PyPDF2 at your command line using `pip install PyPDF2`


### Reading PDFs

In [5]:
# import the PyPDF2 library
import PyPDF2

In [6]:
# Open PDF file
f = open("Working_Business_Proposal.pdf", mode = "rb")

### 'Why rb' mode is necessary
A pdf file is not just plain text, so it needs to be opened in binary mode. There will be characters in the file that can't be interpreted as plain text.

In [8]:
# Call the PyPDF2.PdfFileReader( object_name )
pdf_reader = PyPDF2.PdfReader(f)

In [12]:
# checking number of pages in the pdf
len(pdf_reader.pages)

5

In [17]:
# getting a specific page
page_one = pdf_reader.pages[0]

In [21]:
# getting text from page one
page_one_text = page_one.extract_text()

In [22]:
page_one_text

'Business Proposal The Revolution is Coming Leverage agile frameworks to provide a robust synopsis for high level overviews. Iterative approaches to corporate strategy foster collaborative thinking to further the overall value proposition. Organically grow the holistic world view of disruptive innovation via workplace diversity and empowerment. Bring to the table win-win survival strategies to ensure proactive domination. At the end of the day, going forward, a new normal that has evolved from generation X is on the runway heading towards a streamlined cloud solution. User generated content in real-time will have multiple touchpoints for offshoring. Capitalize on low hanging fruit to identify a ballpark value added activity to beta test. Override the digital divide with additional clickthroughs from DevOps. Nanotechnology immersion along the information highway will close the loop on focusing solely on the bottom line. Podcasting operational change management inside of workﬂows to esta

In [23]:
f.close()

### Adding to PDFs

### Links:

### CSV
https://www.educba.com/csv-files-into-excel/

### Unicode
https://www.twilio.com/docs/glossary/what-is-unicode

#### Character Encoding
https://www.motionpoint.com/blog/the-importance-of-character-encoding-website-translation-user-experience/#:~:text=What%20is%20Character%20Encoding%3F,a%20letter%2C%20number%20or%20symbol.

### Misc:

#### Character Encoding

***Unicode*** is an international character encoding standard that provides a unique number for every character across languages and scripts, making almost all characters accessible across platforms, programs, and devices.

***Character encoding*** is the process of representing individual characters(especially the written characters of human language) using a corresponding encoding system made up of other symbols and types of data. 

- It tells computers how to interpret digital data into letters, numbers and symbols. This is done by assigning a specific numeric value to a letter, number or symbol.

- It is a way to convert text data into binary numbers.

A ***codec*** is short for coder/decoder is a chip that decodes analog-to-digital conversion and digital-to-analog.