# Saving files in Python with Pandas

| Function | File Format | Description | Key Parameters |
| :--- | :--- | :--- | :--- |
| `df.to_csv()` | **CSV** (Comma Separated Values) | The most common format for plain-text data. Simple, human-readable, and universally compatible. | `path_or_buf`, `sep`, `index`, `header` |
| `df.to_excel()` | **Excel** (`.xlsx`, `.xls`) | Saves data to an Excel workbook. Great for sharing with non-programmers. | `excel_writer`, `sheet_name`, `index` |
| `df.to_parquet()` | **Parquet** (`.parquet`) | A highly efficient, columnar storage format optimized for speed and reduced file size. Ideal for "big data" and analytic pipelines. | `path`, `engine`, `index` |
| `df.to_json()` | **JSON** (JavaScript Object Notation) | Saves data in a structured, hierarchical text format. Common for web applications and APIs. | `path_or_buf`, `orient` |

In [1]:
# Import required libraries for data analysis and visualization
import pandas as pd
import matplotlib.pyplot as plt
import datetime as dt
import yfinance as yf

# Set pandas display option for better float formatting
pd.set_option('display.float_format', lambda x: '%.2f' % x) #setting float format to 2 decimal places, this will help in better visualization of data

# Define the company ticker symbol
company = "AAPL"

# Define start and end dates for data download
start = dt.datetime(2020, 1, 1) # January 1, 2020 with dt package
end = dt.datetime(2025, 1, 1) # January 1, 2025 with dt package

# Download stock data using yfinance
# The function returns a DataFrame directly
stocks = yf.download(company, start=start, end=end)

# Clean column names by removing the 'Ticker' multi-index level
stocks.columns = stocks.columns.droplevel(1) # remove the 'Ticker' multi index level

# Display the first few rows of the data
stocks.head()

[*********************100%***********************]  1 of 1 completed


Price,Close,High,Low,Open,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2020-01-02,72.47,72.53,71.22,71.48,135480400
2020-01-03,71.76,72.52,71.54,71.7,146322800
2020-01-06,72.34,72.37,70.63,70.89,118387200
2020-01-07,72.0,72.6,71.78,72.35,108872000
2020-01-08,73.15,73.46,71.7,71.7,132079200


In [21]:
stocks.columns.name =None
stocks.head()

Unnamed: 0_level_0,Close,High,Low,Open,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2020-01-02,72.47,72.53,71.22,71.48,135480400
2020-01-03,71.76,72.52,71.54,71.7,146322800
2020-01-06,72.34,72.37,70.63,70.89,118387200
2020-01-07,72.0,72.6,71.78,72.35,108872000
2020-01-08,73.15,73.46,71.7,71.7,132079200



**Use the `df.to_csv` functions to save a file**

In [22]:
#lets use the resampled monthly data as an example of saving and opening a file
Monthlystock = stocks.resample('ME').agg({
    # Column: Function to apply
    'Close': 'mean',        # Get the average closing price for the month
    'Open': 'mean',         # Get the average opening price for the month
    'High': 'mean',          # Get the highest price for the month
    'Low': 'max',           # Get the lowest price for the month
    'Volume': 'sum',        # Get the total volume traded for the month
})
Monthlystock.head() #the man close, high, low, open make sense but this does not for volume

Unnamed: 0_level_0,Close,Open,High,Low,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2020-01-31,75.26,75.08,75.84,77.54,2934370400
2020-02-29,75.24,75.01,76.21,78.2,3019279200
2020-03-31,63.47,63.14,65.23,70.89,6280072400
2020-04-30,65.88,65.74,66.7,69.74,3265299200
2020-05-31,75.12,74.86,75.92,76.76,2805936000


In [23]:
Monthlystock.to_csv('data/MonthlyAAPLStock.csv') #saving the monthly data to a csv file in the data folder

In [None]:
#opening the saved csv file
Monthlystock_loaded = pd.read_csv('data/MonthlyAAPLStock.csv', index_col=0, parse_dates=True)
Monthlystock_loaded.head() # it is loaded 

Unnamed: 0_level_0,Close,Open,High,Low,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2020-01-31,75.26,75.08,75.84,77.54,2934370400
2020-02-29,75.24,75.01,76.21,78.2,3019279200
2020-03-31,63.47,63.14,65.23,70.89,6280072400
2020-04-30,65.88,65.74,66.7,69.74,3265299200
2020-05-31,75.12,74.86,75.92,76.76,2805936000



**Use the `df.to_excel` functions to save a file**

In [25]:
Monthlystock.to_excel('data/MonthlyAAPLStock.xlsx') #saving the monthly data to an Excel file in the data folder

In [None]:
# Read the saved xlsx file back into a DataFrame
Monthlystock_excel = pd.read_excel('data/MonthlyAAPLStock.xlsx', index_col=0, parse_dates=True)
Monthlystock_excel.head() # it is loaded 

Unnamed: 0_level_0,Close,Open,High,Low,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2020-01-31,75.26,75.08,75.84,77.54,2934370400
2020-02-29,75.24,75.01,76.21,78.2,3019279200
2020-03-31,63.47,63.14,65.23,70.89,6280072400
2020-04-30,65.88,65.74,66.7,69.74,3265299200
2020-05-31,75.12,74.86,75.92,76.76,2805936000


**Use the `df.to_parque` functions to save a file**

In [27]:
# save the data to a parquet file
#Monthlystock.columns = Monthlystock.columns.rename('Aggregation', level=1) #needed to remove None level name error
Monthlystock.to_parquet('data/MonthlyAAPLStock.parquet', engine='fastparquet') #saving the monthly data to a parquet file in the data folder


In [28]:
Monthlystock_parquet = pd.read_parquet('data/MonthlyAAPLStock.parquet', engine='fastparquet')
Monthlystock_parquet.head() # it is loaded but in a different format

Unnamed: 0_level_0,Close,Open,High,Low,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2020-01-31,75.26,75.08,75.84,77.54,2934370400
2020-02-29,75.24,75.01,76.21,78.2,3019279200
2020-03-31,63.47,63.14,65.23,70.89,6280072400
2020-04-30,65.88,65.74,66.7,69.74,3265299200
2020-05-31,75.12,74.86,75.92,76.76,2805936000
