This tutorial is meant to introduce the user to data analysis using Python, Jupyter Notebook, and Pandas.

The tutorial you're looking at right now is meant for users new to the Python and Jupyter Notebook ecosystem. Other tutorials, which are less curated but more detailed, include the following:

* [Python's "The Python Tutorial"](https://docs.python.org/3/tutorial/index.html), which describes *everything* about Python, but may be intimidating to tackle all at once.
* [The w3 schools' Python Tutorial](https://www.w3schools.com/python/default.asp), which has less in-depth detail, but progresses more slowly and allows you to try out small code examples as you learn.
* [Pandas' own "10 minutes to pandas" tutorial](https://pandas.pydata.org/docs/user_guide/10min.html).
* [A more detailed suite of Pandas tutorials](https://pandas.pydata.org/docs/getting_started/intro_tutorials/index.html).

# Common Terminology

* **Module**: A module is a file containing Python code.
* **Package**: In the context of Python, a package is a collection of modules and sub-packages
* **Python Interpreter**: A computer program that executes Python code.
* **Python Environment**: The context in which a Python program runs. The installed Python interpreter and any installed packages are available within a Python environment.
* **Virtual Environment**: A Python environment that is isolated from the host environment (i.e., you can run Python code that only has access to a specific set of packages and modules, to prevent conflicts between packages and modules).

# Frequently Used Tools

* pip is the most common package manager for Python and can be installed via https://pip.pypa.io/en/stable/installation/
* Anaconda is a package manager for Python specifically for data analysis and can be accessed at https://www.anaconda.com/
* Pandas is a library for data analysis and manipulation and can be accessed at https://pandas.pydata.org/
* virtualenv is a tool to create isolated Python environments and can be accessed at https://virtualenv.pypa.io/en/latest/

First, an introduction to how Jupyter Notebooks work.

This tutorial is done in a Jupyter Notebook (also known as an Ipython Notebook). In contrast to typical Python code, a Jupyter notebook allows you to immediately execute code and view the results visually. This is useful for more iterative programming, or for showing your work to others.

Start by executing the below code cell: 

In [17]:
x = 1

With Jupyter notebooks, you can execute code in the cell. In some cases, as in the above, executing code successfully will produce no result outside of the announcement that it was executed. To confirm that it was executed successfully, we can run code that contains only the variable we just set above, which will display the result in an output cell just below the cell being executed.

In [5]:
x

1

Note that cells in Jupyter have a cumulative effect depending on the order they are executed. If you execute the same block of code multiple times, it will execute based on the previous state of the code. Try executing the block multiple times below to see the result.

Afterwards, if you go back to the `x = 1` block, re-execute that, and then execute this block again, you will see that x has returned to its original state. 

In [23]:
x += 1
# As before, if the last line of our code contains only a variable, that variable will be displayed in the output.
x

5

Note that the order in which cells are executed does not always match the order they are arranged. Below is a repetition of the same `x = 1` code. If you execute this, and then rerun the above cell again, you will see it behaves just the same as the `x = 1` cell above. 

In [19]:
x = 1

If code produces an error, the output cell will display the error.

In [16]:
# The currency sign is an invalid character in Python. Executing this will produce a Syntax Error. (Syntax is a set of rules defining the symbols of a programming language and how they can relate to each other to produce valid code).
$

SyntaxError: invalid syntax (2707316568.py, line 2)

To start, we'll need to import the Pandas Library.

There are several ways to install Pandas, depending on how you manage your packages. Examples include:
`pip install pandas`
`conda install pandas`
Or installing it through a package manager within an IDE such as PyCharm.

Below is the first code cell. Depending on the content of the code cell, executing the cell will :
* Produce an error
* Run successfully without displaying anything
* Run and display the results.

In [2]:
# Execute the import statement
import pandas as pd

For practice, we will be using a sample data set. We will use the `iris` dataset from the `seaborn` module (a library we will discuss later).

In [3]:
# Run the `read_csv` command (which can take a url as a parameter) and store the data in the variable `iris`
iris = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
# Display the data
iris

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica


As you can see, displaying the data produces a tabular view showing all the columns and rows in a paginated view. The data is stored as a `DataFrame`, an object specific to `pandas`.

To view the first few rows of a cell, we can use the `.head()` method. A method is a command that is called from an object (as opposed to a function, which is called without reference to an object. By calling `.head()` on the `iris` variable, we're indicating we want to get the results of the `.head()` function for the iris DataFrame. If we had another DataFrame and called `.head()` on that, we'd get the results for that DataFrame instead.

In [27]:
iris.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


Documentation on every Pandas method, including `.head()`, can be found at [the pandas documentation website](https://pandas.pydata.org/docs). Documentation for the head method can be found [here](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html#pandas.DataFrame.head).

If you look at the documentation, you may notice the `n=5` section. This indicates that head takes a single parameter, `n`, which defaults to 5. However, we can set it to different values, as with below:

iris.head(n=3)

Note that in Python, arguments can either be provided positionally, based on how the arguments were declared in the definition of the function (this is known as `args`), or provided by specifying the keyword for the parameter, as done above (this is known as `keyword args` or `kwargs`). While positional arguments are often simpler to write and easier to use if you have only a small number of arguments, they can get confusing if you have a large number of arguments for a function or method, and using keyword arguments is preferred in such cases.

Note that there is a similar method for finding the last rows of a DataFrame, `.tail()`

In [28]:
iris.tail(n=2)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
148,6.2,3.4,5.4,2.3,virginica
149,5.9,3.0,5.1,1.8,virginica


Note the number in the far left of the displayed DataFrame. This is known as the `index`. To see more details about the index of a DataFrame, we can use the [`.index()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.index.html#pandas.DataFrame.index) method:

In [30]:
iris.index

RangeIndex(start=0, stop=150, step=1)

This returns a RangeIndex object, an object expressing properties of the given index -- in this case, where it starts, where it stops, and by how much it increments every step (in this case, 1).

Similarly, if we want to know about all the columns available, we can use the [`.columns()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.columns.html#pandas.DataFrame.columns) method.

In [31]:
iris.columns

Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width',
       'species'],
      dtype='object')

This returns an `Index` object, which is used to label rows and columns in a DataFrame. It then shows a list of the column names, and finally provides the `dtype` (or data type) of the elements in the Index, which are of type `object`. In this case, it means they are strings. 

We can limit the view to specific columns by referencing the DataFrame followed by bracket-notation of the list of columns we want to return. For example, if we want to get only `sepal_length`, we can do the following.

In [33]:
iris["sepal_length"]

0      5.1
1      4.9
2      4.7
3      4.6
4      5.0
      ... 
145    6.7
146    6.3
147    6.5
148    6.2
149    5.9
Name: sepal_length, Length: 150, dtype: float64

If we want to get multiple columns, however, the syntax is a little different: We need to provide a comma-separated list of the columns, enclosing them within double brackets:

In [34]:
iris[["sepal_length", "sepal_width"]]

Unnamed: 0,sepal_length,sepal_width
0,5.1,3.5
1,4.9,3.0
2,4.7,3.2
3,4.6,3.1
4,5.0,3.6
...,...,...
145,6.7,3.0
146,6.3,2.5
147,6.5,3.0
148,6.2,3.4


We can limit the view to specific rows using the `` command, which can limit what rows we see by the conditional statements we set.

# Self-Directed Learning: Documentation.

Being able to read documentation for libraries you're programming with is an underrated but essential component of being a programmer. It will be much harder for you to use a library effectively if you don't know what it does and don't know how to find out what it does.

Documentation comes in varying levels of quality, but the Pandas documentation, partly as a product of it being such a popular library, is well-made and thorough.

Now that you've learned some of the basics, look at the [Pandas Dataframe documentation](https://pandas.pydata.org/docs/reference/frame.html) and play around more with the `iris` DataFrame. Look at methods we've discussed such as [`.head()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html#pandas.DataFrame.head) and make sure that you understand the descriptions of the function. Look at other methods we haven't discussed, such as [`.min()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.min.html), and figure out how to implement them on the iris DataFrame.