# Data and Twitter Analysis
---

This notebook introduces students to common data formats and how Python can read and write them. How one decides to structure data will ultimately shape the storage and possible analyses. After a discussion of the data, an exploration of a Twitter API response will exemplify JSON and tweets from the March 4th, 2017 March4Trump in Berkeey will serve as a basis for analysis. At the end, students will collect their own data to explore.

*Estimated Time: 180 minutes*

---

**Topics Covered:**
- .xls and .csv formats
- .html and .xml
- .json
- Twitter API
- Twitter analysis and visualization

**Parts:**
- [Data Formats and Storage](#dataformats)

---
<a id=dataformats></a>
## Data Formats and Storage

Most people are familiar with Microsoft Excel spreadsheet's `.xls` format, great for storing tabular data. However, Microsoft encodes the `.xls` format with a lot of information for displaying it in the software environment as well as remembering any formulas you may have used, among other things. The extra information is often not necessary to simply store the raw data, and is not easily readable by other software. A "bare-bones" `.xls` format is the `.csv`, or "comma-separated value". You may have encountered this format before. It's not any more complicated than the name. All values are separated by commas to delimit columns, while the lines represent rows.

The table:

| Name    | Age | Department | Hometown |
|---------|-----|------------|----------|
| Chris   | 27  | German     | Plymouth |
| Jarrett | 25  | Physics    | Newark   |
| Sofia   | 22  | Chemistry  | Boston   |
| Esther  | 24  | Economics  | Oakland  |


would be represented as:

~~~
Name, Age, Department, Hometown
Chris, 27, German, Plymouth
Jarrett, 25, Physics, Newark
Sofia, 22, Chemistry, Boston
Esther, 24, Economics, Oakland
~~~

Notably, the header is not distinguishable except for being the first row. There is also no way to add any metadata or notes unless it fits into a column or row. Nevertheless, `.csv` is standard for simple data, and is easily read by most software. If you are collaborating with researchers or using different pieces of software you'll want to use this format.

Python can easily dump data into a `.csv`, the most straight-forward approach would be dumping rows from a list of lists, each sublist being a row in your data.

In [None]:
import csv

my_data = [['Name', 'Age', 'Department', 'Hometown'],
            ['Chris', '27', 'German', 'Plymouth'],
            ['Jarrett', '25', 'Physics', 'Newark',],
            ['Sofia', '22', 'Chemistry', 'Boston'],
            ['Esther', '24', 'Economics', 'Oakland']
        ]

with open("my_data.csv", "w") as f:
    writer = csv.writer(f)
    writer.writerows(my_data)

Reading a `.csv` is just as easy:

In [None]:
with open("my_data.csv", "r") as f:
    csv_data = list(csv.reader(f))
    
print(csv_data)

If you still prefer Excel for analysis, you can go ahead and open this file in Excel!