# Data Programming in Python | BAIS:6040
# Module 6. Handling Files

Written by Kang-Pyo Lee 

Topics to be covered:
- Useful operating system functionality
- Writing and reading a file (+ exercises)
- Writing and reading a CSV file using Pandas
- Writing and reading a JSON file (+ exercises)

## Useful Operating System-Dependent Functions for Files

In [None]:
import os

The <b>os</b> module provides a portable way of using operating system dependent functionality for handling files and directories.

In [None]:
os.listdir("./")                        # The path "./" means the current directory.

The <b>os.listdir</b>(path='.') function returns a list containing the names of the entries in the directory given by `path`.

In [None]:
[item for item in os.listdir("./") if item.endswith(".ipynb")]     # a list of ipynb files in the current directory

In [None]:
os.getcwd()

The <b>os.getcwd</b> function returns a string representing the current working directory.

In [None]:
os.path.isfile("Module6_Files.ipynb")

The <b>os.path.isfile</b>(path) function returns True if `path` is an existing regular file. 

In [None]:
os.path.isdir("classdata")

The <b>os.path.isdir</b>(path) function returns True if `path` is an existing directory.

In [None]:
if not os.path.isdir("outcome"):         # Check if there is an existing directory named outcome.
    os.mkdir("outcome")                  # Create a new directory named outcome.

The <b>os.mkdir</b>(path) function creates a directory named `path`.



We are going to save all of our outcome files in the `outcome` directory. 

<hr>

## Writing a File

When writing and reading a file, the first thing you need to do is to open a file in the right mode. 

In [None]:
fw = open("outcome/output.txt", mode="w")

open: https://docs.python.org/3/library/functions.html#open

The built-in <b>open</b> function opens the file provided and returns a corresponding file object. The parameter `mode` is an optional string that specifies the mode in which the file is opened. 
- "r": opens a file for reading. (default)
- "w": opens a file for writing. Creates a new file if it does not exist, or truncates the file if it exists.
- "a": opens for appending at the end of the file without truncating it. Creates a new file if it does not exist.
- "b": opens in binary mode.
- "+": opens a file for updating (reading and writing)

If the file cannot be opened, an OSError is raised. 

In [None]:
fw.write("Hello, world!\n")

The <b>write</b> method writes the content of string to the file, returning the number of characters written. 

In [None]:
print("Hello, world!\n", end="")

Note that writing a string to a file using the <b>write</b> method is basically similar to printing a string on a screen using the <b>print</b> function. The difference is where to write the string. 

In [None]:
fw.close()

The <b>close</b> method closes the file and immediately frees up any system resources used by it. Make sure to close the file object once you're done with it. 

After writing something to a file, always make sure to manually open the file to see if everything has been written as expected.  

In [None]:
with open("outcome/output.txt", mode="w") as fw:
    fw.write("Hello, world!\n")

It is good practice to use the <b>with</b> keyword when dealing with file objects. The advantage is that the file is properly closed after its nested block finishes whithout having to explicitly close the file with the **close** method.

In [None]:
with open("outcome/output.csv", "w") as fw:
    # Write the header row.
    fw.write("num\n")                # Use a new line (\n) between rows.
    
    # Write the value rows.
    for i in range(100):
        fw.write("{}\n".format(i))

When writing a CSV file with a single column, you need to decide the delimiter to specify the boundary between separate rows, e.g., a new line ("\n").

In [None]:
with open("outcome/output2.csv", "w") as fw:
    # Write the header row.
    fw.write("num,col1,col2\n")       # Use a comma (,) between columns and a new line (\n) between rows.
    
    # Write the value rows.
    for i in range(100):
        fw.write("{},{},{}\n".format(i, i*10, i*100))

When writing a CSV file with multiple columns, you also need to decide the delimiter to specify the boundary between separate columns, e.g., comma (",") or tab ("\t").

In [None]:
from seaborn import load_dataset

df = load_dataset("titanic")
df

Let's write some part of data in a dataframe to a file. Specifically, we want to read the values in four columns in the Titanic dataframe row by row and write them to a CSV file. 

In [None]:
with open("outcome/my_titanic.csv", "w") as fw:
    # Write the header row.
    header = "index,survived,pclass,fare\n"
    fw.write(header)
    print(header, end="")            # Print the header row just to check the current status.
    
    # Write the value rows.
    for idx, row in df.iterrows():
        survived = row.survived
        pclass = row.pclass
        fare = row.fare
        
        line = "{},{},{},{}\n".format(idx, survived, pclass, fare)
        fw.write(line)
        print(line, end="")           # Print each row to check the current status.

<hr>

## Reading a File

In [None]:
with open("outcome/output.txt", "r") as fr:
    content = fr.read()                    # Read the whole content in the file.
    print(content) 

In [None]:
with open("outcome/output2.csv", "r") as fr:
    for line in fr:                        # Read the file line by line. 
        print(line, end="")

In [None]:
with open("outcome/output2.csv", "r") as fr:
    lines = fr.readlines()                 # Read the whole content in the file as a list of lines.
                                           # Not recommended if the file is too large to be loaded in memory.
    for line in lines:
        print(line, end="")

In [None]:
with open("outcome/output2.csv", "r") as fr:
    lines = fr.readlines()                 # Read the whole content in the file as a list of lines.
    
    # Decompose the header row into coloumn names
    header = lines[0]
    header = header.rstrip()               # Remove the trailing new line in the header.
    num, col1, col2 = header.split(",")
    
    # Decompose each line into values
    for line in lines[1:]:                 # Start from the second row.
        line = line.rstrip()               # Remove the trailing new line in each line.
        num_val, col1_val, col2_val = line.split(",")
        print("{}: {}, {}: {}, {}: {}".format(num, num_val, col1, col1_val, col2, col2_val))

In [None]:
open("outcome/outputtt.csv", "r")

## Exercises for File Writing and Reading

<hr>

## Reading and Writing a CSV File Using Pandas

In [None]:
import pandas as pd     

In [None]:
df = pd.read_csv("outcome/ex_titanic2.csv")
df

pandas.read_csv: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

When you need to read a CSV file and analyze the content in a tabular format with rows and columns, it is a good idea to read the file into a Pandas dataframe. 

In [None]:
df = pd.read_csv("outcome/ex_titanic2.csv", sep="\t")
df

In [None]:
df = pd.read_csv("https://people.sc.fsu.edu/~jburkardt/data/csv/biostats.csv")
df

In [None]:
# ! pip install --user openpyxl xlrd xlsxwriter 

In [None]:
df = pd.read_excel("http://go.microsoft.com/fwlink/?LinkID=521962", sheet_name="Sheet1")
df

pandas.read_excel: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html

In [None]:
df.to_csv("outcome/my_data.csv", sep=",", index=False)

pandas.DataFrame.to_csv: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html

In [None]:
df.to_excel("outcome/my_data.xls", sheet_name="Sheet1", index=False, engine='xlsxwriter')

pandas.DataFrame.to_excel: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_excel.html

<hr>

## Writing and Reading a JSON File

JSON, which stands for JavaScript Object Notation, is one of the most commonly used formats for data transfer. It is preferred, because it is clean, easy to read, and easy to parse. Many websites provide JSON-enabled APIs, or Application Programming Interfaces. 

In [None]:
import json

In [None]:
states = {"IL": "Illinois", "WI": "Wisconsin", "IA": "Iowa", "NE": "Nebraska", "MN": "Minnesota"}
states

In [None]:
with open("outcome/my_states.json", "w") as fw:
    json.dump(states, fw)

The <b>json.dump</b>(obj, fp, ...) function serializes `obj` as a JSON formatted stream to `fp`.

Writing a Python object to a file is called serialization. If you manually open the JSON file and see the content, you will see that it looks just like a Python dictionary. 

In [None]:
with open("outcome/my_states.json", "r") as fr:
    states_new = json.load(fr)

The <b>json.load</b>(fp, ...) deserializes `fp` to a Python object. 

Reading a file back into a Python object is called deserialization. 

In [None]:
states_new

In [None]:
status = {'created_at': 'Mon Oct 14 03:07:35 +0000 2019',
 'id': 1183580078518743041,
 'id_str': '1183580078518743041',
 'text': '@Charalanahzard Those rumors are false',
 'truncated': False,
 'entities': {'hashtags': [],
  'symbols': [],
  'user_mentions': [{'screen_name': 'Charalanahzard',
    'name': 'Alanah Pearce',
    'id': 96997907,
    'id_str': '96997907',
    'indices': [0, 15]}],
  'urls': []},
 'source': '<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>',
 'in_reply_to_status_id': 1183499365077225472,
 'in_reply_to_status_id_str': '1183499365077225472',
 'in_reply_to_user_id': 96997907,
 'in_reply_to_user_id_str': '96997907',
 'in_reply_to_screen_name': 'Charalanahzard',
 'user': {'id': 44196397,
  'id_str': '44196397',
  'name': 'Elon Musk',
  'screen_name': 'elonmusk',
  'location': '',
  'description': '',
  'url': None,
  'entities': {'description': {'urls': []}},
  'protected': False,
  'followers_count': 28672988,
  'friends_count': 80,
  'listed_count': 51716,
  'created_at': 'Tue Jun 02 20:12:29 +0000 2009',
  'favourites_count': 4062,
  'utc_offset': None,
  'time_zone': None,
  'geo_enabled': False,
  'verified': True,
  'statuses_count': 9065,
  'lang': None,
  'contributors_enabled': False,
  'is_translator': False,
  'is_translation_enabled': False,
  'profile_background_color': 'C0DEED',
  'profile_background_image_url': 'http://abs.twimg.com/images/themes/theme1/bg.png',
  'profile_background_image_url_https': 'https://abs.twimg.com/images/themes/theme1/bg.png',
  'profile_background_tile': False,
  'profile_image_url': 'http://pbs.twimg.com/profile_images/1178009465674747907/k5fzT65B_normal.jpg',
  'profile_image_url_https': 'https://pbs.twimg.com/profile_images/1178009465674747907/k5fzT65B_normal.jpg',
  'profile_banner_url': 'https://pbs.twimg.com/profile_banners/44196397/1556675519',
  'profile_link_color': '0084B4',
  'profile_sidebar_border_color': 'C0DEED',
  'profile_sidebar_fill_color': 'DDEEF6',
  'profile_text_color': '333333',
  'profile_use_background_image': True,
  'has_extended_profile': True,
  'default_profile': False,
  'default_profile_image': False,
  'following': True,
  'follow_request_sent': False,
  'notifications': False,
  'translator_type': 'none'},
 'geo': None,
 'coordinates': None,
 'place': None,
 'contributors': None,
 'is_quote_status': False,
 'retweet_count': 413,
 'favorite_count': 10413,
 'favorited': False,
 'retweeted': False,
 'lang': 'en'}

Tweet Object: https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/tweet-object

In [None]:
type(status)

In [None]:
status2 = {'created_at': 'Sun Oct 13 22:28:12 +0000 2019',
 'id': 1183509770382135297,
 'id_str': '1183509770382135297',
 'text': '@TeslaGong Coming v soon',
 'truncated': False,
 'entities': {'hashtags': [],
  'symbols': [],
  'user_mentions': [{'screen_name': 'TeslaGong',
    'name': 'Tesla in the Gong ( I am a Tesla 🤖 )',
    'id': 1008296232261783552,
    'id_str': '1008296232261783552',
    'indices': [0, 10]}],
  'urls': []},
 'source': '<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>',
 'in_reply_to_status_id': 1183509540844605440,
 'in_reply_to_status_id_str': '1183509540844605440',
 'in_reply_to_user_id': 1008296232261783552,
 'in_reply_to_user_id_str': '1008296232261783552',
 'in_reply_to_screen_name': 'TeslaGong',
 'user': {'id': 44196397,
  'id_str': '44196397',
  'name': 'Elon Musk',
  'screen_name': 'elonmusk',
  'location': '',
  'description': '',
  'url': None,
  'entities': {'description': {'urls': []}},
  'protected': False,
  'followers_count': 28672988,
  'friends_count': 80,
  'listed_count': 51716,
  'created_at': 'Tue Jun 02 20:12:29 +0000 2009',
  'favourites_count': 4062,
  'utc_offset': None,
  'time_zone': None,
  'geo_enabled': False,
  'verified': True,
  'statuses_count': 9065,
  'lang': None,
  'contributors_enabled': False,
  'is_translator': False,
  'is_translation_enabled': False,
  'profile_background_color': 'C0DEED',
  'profile_background_image_url': 'http://abs.twimg.com/images/themes/theme1/bg.png',
  'profile_background_image_url_https': 'https://abs.twimg.com/images/themes/theme1/bg.png',
  'profile_background_tile': False,
  'profile_image_url': 'http://pbs.twimg.com/profile_images/1178009465674747907/k5fzT65B_normal.jpg',
  'profile_image_url_https': 'https://pbs.twimg.com/profile_images/1178009465674747907/k5fzT65B_normal.jpg',
  'profile_banner_url': 'https://pbs.twimg.com/profile_banners/44196397/1556675519',
  'profile_link_color': '0084B4',
  'profile_sidebar_border_color': 'C0DEED',
  'profile_sidebar_fill_color': 'DDEEF6',
  'profile_text_color': '333333',
  'profile_use_background_image': True,
  'has_extended_profile': True,
  'default_profile': False,
  'default_profile_image': False,
  'following': True,
  'follow_request_sent': False,
  'notifications': False,
  'translator_type': 'none'},
 'geo': None,
 'coordinates': None,
 'place': None,
 'contributors': None,
 'is_quote_status': False,
 'retweet_count': 28,
 'favorite_count': 513,
 'favorited': False,
 'retweeted': False,
 'lang': 'en'}

In [None]:
status3 = {'created_at': 'Sun Oct 13 09:27:47 +0000 2019',
 'id': 1183313372885839873,
 'id_str': '1183313372885839873',
 'text': 'Space Jam should’ve won the Oscar https://t.co/E7l2DCAxDH',
 'truncated': False,
 'entities': {'hashtags': [],
  'symbols': [],
  'user_mentions': [],
  'urls': [],
  'media': [{'id': 1183313363851341824,
    'id_str': '1183313363851341824',
    'indices': [34, 57],
    'media_url': 'http://pbs.twimg.com/media/EGv5PCZU8AA7PZJ.jpg',
    'media_url_https': 'https://pbs.twimg.com/media/EGv5PCZU8AA7PZJ.jpg',
    'url': 'https://t.co/E7l2DCAxDH',
    'display_url': 'pic.twitter.com/E7l2DCAxDH',
    'expanded_url': 'https://twitter.com/elonmusk/status/1183313372885839873/photo/1',
    'type': 'photo',
    'sizes': {'thumb': {'w': 150, 'h': 150, 'resize': 'crop'},
     'medium': {'w': 1023, 'h': 529, 'resize': 'fit'},
     'large': {'w': 1023, 'h': 529, 'resize': 'fit'},
     'small': {'w': 680, 'h': 352, 'resize': 'fit'}}}]},
 'extended_entities': {'media': [{'id': 1183313363851341824,
    'id_str': '1183313363851341824',
    'indices': [34, 57],
    'media_url': 'http://pbs.twimg.com/media/EGv5PCZU8AA7PZJ.jpg',
    'media_url_https': 'https://pbs.twimg.com/media/EGv5PCZU8AA7PZJ.jpg',
    'url': 'https://t.co/E7l2DCAxDH',
    'display_url': 'pic.twitter.com/E7l2DCAxDH',
    'expanded_url': 'https://twitter.com/elonmusk/status/1183313372885839873/photo/1',
    'type': 'photo',
    'sizes': {'thumb': {'w': 150, 'h': 150, 'resize': 'crop'},
     'medium': {'w': 1023, 'h': 529, 'resize': 'fit'},
     'large': {'w': 1023, 'h': 529, 'resize': 'fit'},
     'small': {'w': 680, 'h': 352, 'resize': 'fit'}}}]},
 'source': '<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>',
 'in_reply_to_status_id': None,
 'in_reply_to_status_id_str': None,
 'in_reply_to_user_id': None,
 'in_reply_to_user_id_str': None,
 'in_reply_to_screen_name': None,
 'user': {'id': 44196397,
  'id_str': '44196397',
  'name': 'Elon Musk',
  'screen_name': 'elonmusk',
  'location': '',
  'description': '',
  'url': None,
  'entities': {'description': {'urls': []}},
  'protected': False,
  'followers_count': 28672988,
  'friends_count': 80,
  'listed_count': 51716,
  'created_at': 'Tue Jun 02 20:12:29 +0000 2009',
  'favourites_count': 4062,
  'utc_offset': None,
  'time_zone': None,
  'geo_enabled': False,
  'verified': True,
  'statuses_count': 9065,
  'lang': None,
  'contributors_enabled': False,
  'is_translator': False,
  'is_translation_enabled': False,
  'profile_background_color': 'C0DEED',
  'profile_background_image_url': 'http://abs.twimg.com/images/themes/theme1/bg.png',
  'profile_background_image_url_https': 'https://abs.twimg.com/images/themes/theme1/bg.png',
  'profile_background_tile': False,
  'profile_image_url': 'http://pbs.twimg.com/profile_images/1178009465674747907/k5fzT65B_normal.jpg',
  'profile_image_url_https': 'https://pbs.twimg.com/profile_images/1178009465674747907/k5fzT65B_normal.jpg',
  'profile_banner_url': 'https://pbs.twimg.com/profile_banners/44196397/1556675519',
  'profile_link_color': '0084B4',
  'profile_sidebar_border_color': 'C0DEED',
  'profile_sidebar_fill_color': 'DDEEF6',
  'profile_text_color': '333333',
  'profile_use_background_image': True,
  'has_extended_profile': True,
  'default_profile': False,
  'default_profile_image': False,
  'following': True,
  'follow_request_sent': False,
  'notifications': False,
  'translator_type': 'none'},
 'geo': None,
 'coordinates': None,
 'place': None,
 'contributors': None,
 'is_quote_status': False,
 'retweet_count': 13349,
 'favorite_count': 117950,
 'favorited': False,
 'retweeted': False,
 'possibly_sensitive': False,
 'lang': 'en'}

In [None]:
statuses = [status, status2, status3]

In [None]:
type(statuses)

In [None]:
with open("outcome/my_statuses.json", "w") as fw:
    json.dump(statuses, fw)

In [None]:
with open("outcome/my_statuses.json", "r") as fr:
    statuses_new = json.load(fr)

In [None]:
statuses_new[0]

## Exercises for JSON File Writing and Reading