# This week

* Reading and writing files
* Pandas

# File Input and Output ("File IO")

Up to this point you have been using APIs and toy datasets that could be easily copy-and-pasted into your Python script or Jupyter Notebook.  As you delve into "real" projects you will need to read data stored on the hard drive in myriad data formats (CSV, SHP, DBF, etc.).

Basic Python provides fine grain control over file reading and writing. By providing so much control, the process can be quite tedious.  It is important to learn the "manual" way to read and write files for the cases when you get data in some strange format or you need to store it in memory (i.e., RAM) in a non-typical data structure.

Working with files does not feel very "pythonic" because so much must be done manually. However, many Python packages recognize this challenge and provide convenience functions for reading and writing files.  These functions abstract the hairy details.  You will see how NumPy and Pandas deal with file IO in this Notebook; they both provide convenience functions that automate many of the tedious steps

#### Basic steps to follow when working with files (both reading and writing).
 1. Use the `open()` function to create a file object
 2. Do some stuff with the file object
   * read the content of the file
   * write stuff to the file
 3. Use the `close()` method of the file object when you're finished

### Writing Files - Basic Python

In the first step, we create a file object.

**Action**: Before running the following cell, inspect the directory where this Notebook is located. Is there a file called `newfile.txt` there?

In [1]:
outfile = open('newfile.txt', 'w')

**Action**: Go back to that directory. Is `newfile.txt` there now?

**Note**: The `'w'` in the cell above stands for "write". That cell is saying, "create a file called `newfile.txt` and prepare that file to be written to." If `newfile.txt` already existed, then it would be overwritten.

Since (nearly) everything in python is an... say it with me... "object", we can check its `type`.

In [2]:
type(outfile)

file

We can also `dir` the object.

In [3]:
print(dir(outfile))

['__class__', '__delattr__', '__doc__', '__enter__', '__exit__', '__format__', '__getattribute__', '__hash__', '__init__', '__iter__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', 'close', 'closed', 'encoding', 'errors', 'fileno', 'flush', 'isatty', 'mode', 'name', 'newlines', 'next', 'read', 'readinto', 'readline', 'readlines', 'seek', 'softspace', 'tell', 'truncate', 'write', 'writelines', 'xreadlines']


The `write` method of a `file` object writes content to a file.

In [4]:
outfile.write('a few words\n')

In [5]:
outfile.write('some more words')

**Note**:
 * You can only "write" strings to a file.
 * The `write` method returns the number of characters written to the file
 * If you want a line break in the file, you need to explicitly include it by using `\n`.

The third step is to close the file object.

In [6]:
outfile.close()

**Action**: Go back to the directory and open `newfile.txt` in Notepad++ or BBedit. Does it have the content you wrote to it?

**Action**: Close `newfile.txt`. Repeat all the steps above, but after each step open the file to see what it contains, and then close it before running the next cell. 

The following cell creates a slightly more complex data set that we can write to a file.

In [7]:
import numpy as np
np.random.seed(789)   # ensure we get the same "random" data each time
data = np.random.randint(1,100,(5,10))
data

array([[52, 15, 31, 83,  2, 77, 91, 21, 28, 53],
       [82, 73, 17, 98, 63, 64, 66, 25, 58, 80],
       [ 5, 34, 69, 82, 68,  2, 53, 26, 39, 94],
       [ 6, 52, 32, 54, 69, 16, 90, 56, 43, 60],
       [60, 48, 28, 23, 54, 41, 38, 21,  9, 34]])

We will open a file with the *write* flag, just like last time.

In [8]:
outfile = open('newfile_nums.csv','w')

The next cell fills up the file. Basic python is kinda dumb when it comes to writing files, so we need to do everything by hand. We will use `for` loops to drill into each element of the array to save them one-by-one. Recall that we can only write strings, so the numbers in the array must be converted to strings before writing. In this case we'll make a string for each row in the array, and then write that string to the file.

In [9]:
for row in data:
    out_str = ''
    for value in row:
        out_str += str(value) + ','
    out_str = out_str[:-1] + '\n'
    outfile.write(out_str)
outfile.close()

**Action**: Open `newfile_nums.csv` in Notepad++ or BBedit. Notice how the code above matches what is in the file. Also note that the `\n` is not visible, but acts as an end of line character. Close the file and then open it in Excel. (Raise your hand if you opened it in Excel first instead of the text editor. Go back now and open it in a text editor so that the previous sentences make sense.) Notice that Excel processes the file by placing the contents into cells.

It is important to understand what end of line characters are doing. [This article](http://blog.codinghorror.com/the-great-newline-schism/) gives some history and context. Although different operating systems use different end of line characters, in python you can simply use `\n` and python will automatically translate it to the appropriate version for your operating system.

**Note**: You are writing human-readable text (ASCII) files. We use file extensions like ".csv" or ".txt" because they are conventions outside of Python.  If you double-click a CSV file in the file manager on your computer it will generally automatically open the file in Excel.  However, you could give it the extension ".xyz" and Python would not do anything different and it would still be readable by Excel. 

**Action**: Open a PDF file in a text editor (e.g., Notepad++ or BBEdit). You'll just see a bunch of junk. These are binary files that can only be read by specialized software. If you want to store data for a long time or make it easier for different software to read, then the more flexible ASCII format is preferred. [This article](https://www.nayuki.io/page/what-are-binary-and-text-files) is optional; it takes a deeper dive into this topic. Be sure you're comfortable with the difference between a human readable (i.e., "text file" or "ASCII file") and a "binary file."

### Reading Files - Basic Python

The steps for reading files in python are the same as for writing them. First we `open` the file, but this time we use the `r` flag for "read."

In [None]:
infile = open('newfile.txt', 'r')

In [None]:
type(infile)

**Note**: The file type of `infile` is `_io.TextIOWrapper`, just like `outfile` was above.

In [None]:
contents = infile.read()
contents

You only get one shot at pulling content from a file; if you want to read it again, it needs to be opened again. In the next cell `contents2` is an empty string since we already pulled the data from `infile`.

In [None]:
contents2 = infile.read()
contents2

**Note**: The contents of `newfile.txt` have not changed on the hard drive as a result of what we've done. We are just reading it.

Again, the third step is to close the file.

In [None]:
infile.close()

The `read()` method introduced above is generally not too useful since it generates just one long string. The next cell introduces `readlines`.

In [None]:
infile = open('newfile.txt', 'r')

In [None]:
contents = infile.readlines()
contents

In [None]:
infile.close()

**Note**: Whoa!! What just happened?? What type of object is `contents` now? It's a list.  Each element in the list is one line from the text file. Notice that the `\n` is still hanging around (we'll fix that later).

Let's return to our fancy CSV file. The next cell opens the file, reads in the content using `readlines` and then closes the file.

In [None]:
infile = open('newfile_nums.csv')
data = infile.readlines()
infile.close()

In [None]:
data

**Action**: Check out `data`. What type of object is `data`?  What will `data[0][6]` return? What type of object is `data[0][6]`? **STOP!** The following cells answer these questions, but test yourself first... see if you understand what you're looking at.

In [None]:
type(data)

In [None]:
data[0][6]

In [None]:
type(data[0][6])

Let's get rid of that pesky `\n`. We do this using the `strip` method of a string object. Recall that `data` is a list not a string. We therefore need to loop through each element of `data` and process the strings one-by-one. We will use a "list comprehension", which we learned a few weeks ago.

In [None]:
data = [i.strip() for i in data]  # remove the '\n' from each element
data

Let's check our old friend `data[0][6]`. What will it return now?

In [None]:
data[0][6]

That was pretty useful, but each element of `data` is still a long string.  We can split the string apart using the `split` method. Notice that we need to tell `split` that we want it to split each time it sees a comma.

In [None]:
data = [i.split(',') for i in data]
data

How is `data[0][6]` doing? What do you think?

In [None]:
data[0][6]

That is looking pretty good, but we now have a list of lists filled with strings, not numbers. In the following cell, we go through the list, grab each string and convert it to an integer.

In [None]:
for row_id in range(len(data)):
    for col_id in range(len(data[row_id])):
        data[row_id][col_id] = int(data[row_id][col_id])
data        

One last time, what do you expect from `data[0][6]`?

In [None]:
data[0][6]

**Note**: I will be the first to admit that this is not a lot of fun! Most advanced python packages recognize that what I have shown above is very tedious and prone to errors; as a result they provide custom readers and writers. The reason I show you this is that sometimes you need this fine grained control over the reading and writing process. Data is not always delivered to you in a clean way... recall the JSON data messes we have seen from APIs! Sometimes you need to read and write data in this brute force style.

### File IO with NumPy

The following cells show numpy functions for reading and writing text files. Notice that this parallels what you saw above. In both cases you need to explicitly say that the content is separated by commas. You also need to explicitly say that you want the results stored as integers. Even with all the flags, these two one-liners are certainly simpler to use than all the steps shown above.

In [None]:
data = np.loadtxt('newfile_nums.csv', delimiter=',', dtype=int)
data

In [None]:
np.savetxt('newfile_numpy.csv', data, delimiter=',', fmt='%i')

**Action**: Go to your file manager and verify that `newfile_numpy.csv` is there and contains what you think it should.

# Pandas

The pandas package might be the biggest game changer in python (for data analysis) in recent years. For those of you who are familiar with other programming languages (including R), I hope you have recognized the mixture of simplicity and power in the basic python syntax you've learned so far. Pandas brings simplicity and power to data manipulation. That is not to say that it's *easy* to learn, but there are big payoffs from mastering this package. In a couple weeks we will see a package called geopandas, which brings spatial functionality to what you'll learn this week.

#### Pandas:
* Name comes from [**PAN**el **DA**ta](https://en.wikipedia.org/wiki/Panel_data), although the package is not constrained to this use
* Provides Python a **data frame** (similar to what you've seen in R)
* Structured manipulation tools
* Built on top of NumPy
* Very **efficient** (i.e., it's fast)
* Rapidly evolving package; the team regularly releases new code
* Project is lead by Wes McKinney

#### Resources  
- [pandas.pydata.org](http://pandas.pydata.org/)
- [Python for Data Analysis](http://www.amazon.com/Python-Data-Analysis-Wrangling-IPython/dp/1449319793) by Wes McKinney
- [Cheat sheet](https://s3.amazonaws.com/quandl-static-content/Documents/Quandl+-+Pandas,+SciPy,+NumPy+Cheat+Sheet.pdf) by Quandl

#### Data Analysis

The process of conducting an analysis almost always requires working with data. These are some the steps in an analysis.

* Get the raw data    
 * Conduct your own experiments, scraping the web, download data using an API, etc.
 * Data could come in different formats; might be unstructured

* Processing    
 * Getting ready
 * Cleaning, reshaping, joining, grouping

* Exploratory data analysis
 * Need to understand what you have
 * Plotting, summarizing, familiarizing

* Analysis
 * Statistics, machine learning, etc.

* Visualization   
 * Share results
 * Make tables, graphs, maps, etc.

* But...
> 80% of data analysis is spent on the process of cleaning and preparing the data -[Hadley Wickham](http://vita.had.co.nz/papers/tidy-data.pdf)

### A Simple Example

Based on Data Wrangling Kung Fu with Pandas by Wes McKinney

In [None]:
import pandas as  pd
import numpy as np
print(pd.__version__)

**Note**: Pandas is in active development so sometimes old examples you find online no longer work. Typically that functionality has been superseded by something "better." You can scroll through this [history of the project](https://pandas.pydata.org/pandas-docs/stable/release.html) to see how often new releases come out.

__Note__: Most python packages follow the convention of putting the version number in the attribute `__version__`. This violates an earlier rule of thumb I gave you: "ignore attributes with leading underscores." Don't know what to tell you.

The following cell simply creates some data for us to use in pandas. You're not technically responsible for the code in this cell, but you should be familiar with most of what is happening here. This *new* stuff is the `with` command and the `format` method of a string... both are useful.

In [None]:
dates = ['2015-09-16', '2015-09-17', '2015-09-18', '2015-09-19']
species = ['dogs','cats','birds']

filename = 'pandas_example.csv'

np.random.seed(123456)
with open(filename,'w') as outfile:
    outfile.write('day,species,animals\n')
    for d in dates:
        for s in species:
            a = np.random.randint(1,10, size=1)[0]
            tmp = '{0},{1},{2}\n'.format(d,s,a)
            outfile.write(tmp)

In [None]:
with open(filename, 'r') as infile:
    print(infile.read())

**Action**: Open `pandas_example.csv` in a text editor and Excel to get a feeling for the data you'll be working with.

#### Create a data frame

When we introduced numpy, we spent a lot of time on the array. In this introduction to Pandas, you'll see a lot of material on the *data frame*.

The `pd.read_csv` function reads a CSV from the hard drive, and returns a pandas data frame. The data we have are the number of new animals entering the animal shelter each day.

In [None]:
df = pd.read_csv(filename)
print(df)

#### Reshape with `pivot`

Although the data in the above cell can be understood, it can be *reshaped* a little to make it clearer.

In [None]:
data = df.pivot(index='day', columns='species', values='animals')
print(data)

In [None]:
data.columns   # column headers

In [None]:
data.index     # row headers

In [None]:
print(type(data.columns))
print(type(data.index))

**Note**: In pandas, **`index`** refers to the row labels and **`columns`** to the column labels. But notice that they have the same type.

#### Column access

The two most straightforward ways to extract a __column__ from the data frame are seen in the next two cells.

In [None]:
data['dogs']

In [None]:
data.dogs

**Note**: Even after we extract the column, it still knows that its name is `dogs`, and it also knows the name of each row.

#### Row access

Row access is not as clean as column access, but it's still not too difficult. 

- `iloc`: allows you to select using the position
- `loc`: allows you to select using the name

In [None]:
print(data)

In [None]:
data.loc['2015-09-17']

In [None]:
data.iloc[1]

#### General access

The `iloc` and `loc` methods are powerful. For two dimensional data frames, they fallow the pattern: `[row_id, column_id]`

In [None]:
print(data)

Get the number of dogs on 2015-09-17

In [None]:
data.loc['2015-09-17', 'dogs']

In [None]:
data.iloc[1,2]

__Aside__: "Deprecation" is a term used by software developers to let their users know that certain functionality will be discontinued in the future. A "deprecation warning" is typically raised when the user uses the deprecated functionality.  To be clear, deprecated functionality still works "today", but it might not work the next time you update the package.

An example of this is the `ix` method in pandas. `ix` is a single method that does the work of `loc` and `iloc`. The developers realized that `ix` was causing problems for users (think about the kind of problems that might arise if your row indexes or column names [were integers](http://s2.quickmeme.com/img/75/75258f71ed5ec11e6f2e1b20cdecdc40ff7493f04659b7704aac1040442e7a7a.jpg), `ix` might get confused), and that the more specific `loc` and `iloc` methods were safer options. `ix` was a key part of pandas for years, so you might see it in examples you find on line. People's "old code" will still run for the time being, but "new code" should not use `ix` since it's deprecated.

Notice that the following cell runs, but it also gives a big (pink) warning message. Read the "warning" message, hopefully it makes sense now.

In [None]:
data.ix['2015-09-17', 'dogs']

#### Access using the colon ( : )

The colon ( : ) means, "give me everything along that dimension"

In [None]:
print(data)

In [None]:
data.loc['2015-09-17', :]

In [None]:
data.iloc[1,:]

In [None]:
data.loc[:, 'dogs']

In [None]:
data.iloc[:,2]

__Note__: If there is only one term inside the square brackets, Pandas assumes you are indexing the row. Therefore,
- `data.loc['2015-09-17']` is equivalent to `data.loc['2015-09-17',:]`
- `data.iloc[1,:]` is equivalent to `data.iloc[1]`

#### Range access

We can use our old range selection techniques that we originally learned for lists, and then modified for arrays.

In [None]:
print(data)

In [None]:
print(data.iloc[2:4,1:])     # slicing by [rows,columns]

#### Multiple access

You can use lists to select multiple rows and/or columns. Notice though that we are still using the `[row_id, column_id]` structure.

In [None]:
print(data.loc[['2015-09-16', '2015-09-19'], ['birds', 'dogs']])

In [None]:
print(data.iloc[[0,3], [0,2]])

__Action__: Try the following for yourself (Bonus: do each one (at least) two different ways)

In [None]:
print(data)

Get all the data for birds

Get all data for 2015-09-16

Get the data for cats on 2015-09-19

Get the data on cats and dogs on 2015-09-18

#### Summarize rows and columns

Let's explore the data a little. Go through these slowly to see what is happening.

What is the average number of animals (by species) admitted each day?

In [None]:
data.mean(axis=0)

What is the total number of animals each day?

In [None]:
data.sum(axis=1)

How many types of species came in each day?

In [None]:
data.count(axis=1)

#### Sum one row or column

In [None]:
data['dogs'].sum()

In [None]:
data.dogs.sum()

In [None]:
data.loc['2015-09-17'].sum()

**Note**: Notice the logic above. We first grab the column or row that we want, and THEN ask it for its `sum()`.

In [None]:
data.sum(axis=0).loc['dogs']

**Action**: The above cell also gives us the total number of dogs admitted to the shelter. Work through the syntax on your own; notice that we have just rearranged the pieces you've already seen. 
 * If you do not totally understand the above cell, take the time now to review the pandas content again from the beginning.
 * Note that you can copy the line above into a new cell and start removing (or changing) parts of it one-by-one to discover what is happening.

#### Fancy data selection

Again, go slowly through these examples to understand what it happening.

In [None]:
print(data)

Show all the data for just the days when more than 2 dogs entered the shelter.

In [None]:
print(data.dogs > 2)         # identify the days
print('\n')
print(data[data.dogs>2])     # select the days

How many dogs entered the shelter on days when the number of cats was less than average?

In [None]:
print(data.cats.mean())
print('\n')
print(data.cats)
print('\n')
print(data.dogs)

Bring it all together.

In [None]:
data.dogs[data.cats < data.cats.mean()]

**Note**: You can consider the stuff inside the square brackets kind of like a mask that only reveals the `True` observations from the larger object. 

#### Add some data with `pd.concat`

In [None]:
df = pd.read_csv(filename)   # reread the original data
print(df)

The following cell just creates some new data on reptiles.

In [None]:
tmp = {'day': ['2015-09-16','2015-09-18'],
       'species': ['reptiles', 'reptiles'],
       'animals': [11, 7]}
print(pd.DataFrame(tmp))

Notice what is going into the `pd.concat` function below. We pass in a list that contains two data frames; then pandas concatenates them. The `ignore_index` flag is needed since the two input data frames both have an index `0` and `1`.

In [None]:
df = pd.concat([df,pd.DataFrame(tmp)], ignore_index=True)
df.shape

In [None]:
print(df)

We can remove rows and columns using the `drop` method. It is important to tell pandas which `axis` to find the data. In the example below we want to drop a _column_ so we use `axis=1`.

In [None]:
print(df.drop('day', axis=1))

The above cell does not make a permanent change to `df`.

In [None]:
print(df)

We can drop a row using `axis=0`.

In [None]:
print(df.drop(2, axis=0))

Changes you make to a pandas data frame are almost never permanent. We'll make a separate data frame to hack on for the moment.

In [None]:
np.random.seed(888)
changer = pd.DataFrame(np.random.randint(1,20, size=(6,5)), 
                       columns=['A','B','C','D','E'], 
                       index=['r1','r2','r3','r4','r5','r6'])
print(changer)

One way to make a change permanent is to reassign the result of the method back to the same variable.

In [None]:
changer = changer.drop('D', axis=1)
print(changer)

Some methods have an `inplace` argument. If you set this to True, then the change will be permanent.

In [None]:
changer.drop('r3', axis=0, inplace=True)
print(changer)

In [None]:
changer = changer.drop(['r2','r5'], axis=0)
print(changer)

__Note__: In the cell above, the first term is a list of row index names, which resulted in multiple rows being dropped. Go back up and recreate the `changer` data frame from scratch, and run some other drops on it.

Back to the animals.

The `head` method returns the first few lines of the data frame.

In [None]:
print(df.head())

We'll drop row 2 from the data frame

In [None]:
df.drop(2, axis=0, inplace=True)

Let's `reshape` again to get back to something more useful.

In [None]:
data = df.pivot('day','species', 'animals')
print(data)

**Note**: Notice that we now have `NaN` values in our data frame. To pandas, these are "missing values". Be clear that NaN is not "zero". Also note that `NaN` is treated like a float, so all the values in the data frame are now floats.

In [None]:
data.mean(axis=0)

In [None]:
data.sum(axis=1)

**Action**: Pandas *ignores* `NaN` values. Go to your calculator and compute the mean (i.e., average) for the `birds` column in two different ways: 
  1. treat the `NaN` as zero (you should get 5.0)
  2. treat the `NaN` as missing (you should get 6.667)

  Which value did pandas return? In real world analyses you will often be confronted with missing values since real data are messy. As the analyst, the way *you* decide how to treat missing values can have a big impact on the final results.

New columns (and rows) can be added to a data frame.

In [None]:
data['total'] = data.sum(axis=1)
print(data)

How many types of species did we get each day?

In [None]:
data.count(axis=1)

How many days did we get each type of species?

In [None]:
data.count(axis=0)

#### Missing vales: `isnull()` and `fillna()`

Pandas offers a bunch of tools for working with missing data.

In [None]:
print(data)

In [None]:
print(data.isnull())

**Note**: The `isnull` method returns a boolean data frame with `True` for all the elements that are `NaN`.

Sometimes you know that a missing value is really a zero (or some other value). Pandas offers a method to help with this too.

In [None]:
print(data.fillna(0))

In [None]:
print(data)

**Warning**: Notice in the cell above that the `fillna` method didn't actually change our data frame.

Data frames have a `copy` method that makes a copy that is no longer connected to the original data frame.

In [None]:
tmp = data.copy()

In [None]:
tmp.fillna(0, inplace=True)
print(tmp)

**Note**: Recall the `inplace` flag introduced a few cells above. When this is set to `True` the data frame itself is overwritten.

Sometimes we don't want all the overhead that comes with a pandas data frame: we just want to work with the simpler numpy array. In other cases, we might have someone else's function that takes a numpy array as input; again we need to convert. The `values` attribute does that.

In [None]:
x = tmp.values
x

**Note**: There is a bunch of subtlety in the above cell.
  1. `values` is an attribute not a method (we know this because there are no parentheses)
  2. the column and row names from the data frame are not included in the numpy array
  3. a numpy array can only be one type, so it will pick the type that can best accommodate the data

Pandas borrows (steals?) much of the good stuff from the R data frame. In the cells below we will reshape our data using `melt`.

In [None]:
print(df)

In [None]:
data2 = df.pivot('day', 'species', 'animals')
print(data2)

In [None]:
data2.reset_index(inplace=True)
print(data2)

In [None]:
back = pd.melt(data2, id_vars=['day'])   # take us back to where we began
print(back)

**Action**: Identify everything that is different between the `back` data frame and the `df` data frame.

We can easily fix some of these differences... see the cells below.

In [None]:
back = back.dropna()
print(back)

In [None]:
back.rename(columns={'value':'animals'}, inplace=True)
print(back)

**Note**: I hope you are seeing that pandas is a pretty powerful package. It is very rare when there is not a function or method that already does thing you want to do.

Okay, we're (finally) finished. Now we want to save our amazing work to the hard drive. Wouldn't you know it, pandas has a simple one-liner to do this too.

In [None]:
back.to_csv('back.csv', index=False)

**Action**: Go to your hard drive and see the fruits of your labor. You may want to email this file to all your fiends.

# Test Yourself

1) Create a new data frame about bike wheels called `bikes` where:
- the index names are 'bmx', 'road', 'mountain'
- it will have one column named 'diameter'
- the values are 20, 29, 26

Hints:
- There is an example of this earlier in the notebook, or you can use the [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html).
- This is a data frame with 3 rows and 1 column, NOT a data frame with 1 row and 3 columns. If you get stuck, try making a numpy array with 3 rows and 2 columns. Then convert that to a numpy array with 3 rows and 1 column.

2) Compute the area of the wheels, and add a column to your data frame (called `area`) that contains the areas. Note that we're working with diameters now so the equation is:

$area = \pi * \left(\frac{diameter}{2}\right)^2$

Note: Look closely at your results to be sure your areas are correct.

3) Compute the mean diameter and area for all bike types.

4) Select the area of a road bike wheel.

5a) Write your data frame to a file.

5b) When writing the data frame to a file, what does the `index` parameter do? You can trial-and-error by saving the file with `index` set to `True` one time and then `False` another and opening it in Excel or a text editor; and/or you could [read the documentation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_csv.html).

6) Recall the week 1 slides? In that I showed some stats on Python an Stack Exchange. Page 32 indicated that pandas questions are by far the most common python related questions. So let's see how good you are with Googling to answer this question. 

The cells below create two data frames. They share a common column (`tree_id`). The goal is to create single data frame with the columns `height`, `diameter`, `status`, `tree_id`. Column order does not matter, but it is important that the data from tree `890` from `df1` is matched tree `890` from `df2`.

You could do some manual stuff to get this answer, but there is a pandas one-liner to do this. Since it is not shown in the examples in this notebook, you'll need to Google for it. Hint: formulate your question for Google, and then add the word `pandas` to your search query (you don't need to include `python` in your query). Although there is some danger that you might get a [result like this](https://www.popsci.com/sites/popsci.com/files/styles/655_1x_/public/images/2017/03/6990634-panda-hug.jpg?itok=hHdcx6TM&fc=50,50).

In [None]:
df1 = pd.DataFrame([[56,12,789], [87,34,890], [32,9,345],
                    [78,25,567], [21,7,123]], 
                   columns=['height','diameter','tree_id'])
df1

In [None]:
df2 = pd.DataFrame([['sick',345], ['healthy',789], ['sick',123],
                    ['healthy',890], ['healthy',567]], 
                   columns=['status','tree_id'])
df2

7) The following cell is trying to write a string to a file. Fix it.

In [None]:
my_file = open('test_yourself.txt')
my_file.write('geography rocks')
my_file.close()