# Python: Introduction
Data science is like painting. Your dataset is your canvas, your statistical techniques are your colors, and you use your programming to illustrate your technique and artistry. Data scientists with strong programming skills are able to implement techniques much more quickly and are able to be more creative problem solvers than analysts who struggle to manipulate data on their own.

## How do I learn to program?
True or false: if I read enough textbooks and watch enough videos about programming, I'll become a good programmer. This statement could not be more FALSE. Becoming good at programming is much like becoming good at any spoken language. You will only develop your skills by hands-on experience using the language. This process will be hard at times, and you'll sometimes feel like you're struggling to understand a concept or fix a bug. That's natural and that's often when you learn the most. Here are some tips for success:

1. Always work through all the examples until the end - don't stop and assume that you could figure out the rest.
2. If you get stuck - search or an answer. Every time I go to code something that I haven't done before, I inevitably end up looking on one of a number of websites / forums such as Stack Overflow, StackExchange, Quora, or a host of other sites that are best found through a general Google search of your topic. A key skill to develop is knowing when to trust a source and when to keep looking.
3. When in doubt, draw it. When you're trying to accomplish something algorithmically, think through and draw out the logic that you're trying to implement before typing in code.

## Why use Python?
There are numerous strengths of this language, which have helped to propel it to becoming one of the most popular languages for data science.

Strengths of Python:
1. Readability. Many consider Python to be easier to read and interpret than other programming languages.
2. It has many useful numeric and scientific programming packages (like numpy, scipy, matplotlib, and scikit-learn) that are not available in other programming languages
3. Object oriented. From the ground up, Python is object oriented, with all of the primary data types being objects. We'll expore the benefits of this in later modules, but they are considerable.
4. Free. Enough said.
5. Portable. Python works on nearly all systems.
6. Easy to use and easier to learn
7. Provides the simplicity and ease of use of a interpreted language with the advanced tools that are found in compiled languages (dynamic typing, automatic memory management)

Weaknesses of Python:
1. Performance - Python is slower than compiled languages like C.
2. It can be challenging to manage python packages and confusion exists between code bases written in Python 2 vs Python 3.

But what about R or other languages? "R is a programming language developed by statisticians for statisticians; Python was developed by a computer scientist, and it can be used by programmers to apply statistical techniques." ([Sebastian Raschka](https://sebastianraschka.com/blog/2015/why-python.html)) R does statistics extremely well and is a powerful tool for data science. For many applications, either Python or R will get the job done, and both are worth knowing for the aspiring data scientist.

Python, however, consistently pulls ahead in numerous rankings of both the data science community, particularly within the subfield of machine learning, and the broader programming community. The [Kaggle State of Data Science Survey](https://www.kaggle.com/surveys/2017), put together by the preeminant host of online machine learning competitions, rated Python as the most used programming language of a survey of data science professionals. Additionally, those same professionals overwhelmingly recommend Python as the language for new data scientists to learn first. The ratings don't stop there, however. Python is an extremely powerful programming language used in many industries outside of data science as well. The [Institute of Electrical and Electronics Engineers (IEEE) ranked Python](https://spectrum.ieee.org/computing/software/the-2017-top-programming-languages) as the #1 programming language in 2017. In the [TIOBE software index for 2018 ranks Python](https://www.tiobe.com/tiobe-index/) (shown below), Python is the number 4 programming language just after Java, C, and C++, the dominant languages in industry for many years. Just being in the same league as those industry standards demonstrates that Python is a highly transferrable skill that can help to propel your career in diverse directions.

<img src="img/tiobe.png" width="800">

## Interpreted language vs compiled languages
*Section from Charles R. Severance, "Python for Everybody" [Chapter 1](https://www.py4e.com/html3/01-intro)*

Python is a high-level language intended to be relatively straightforward for humans to read and write and for computers to read and process. Other high-level languages include Java, C++, PHP, Ruby, Basic, Perl, JavaScript, and many more. The actual hardware inside the Central Processing Unit (CPU) does not understand any of these high-level languages.

The CPU understands a language we call machine language. Machine language is very simple and frankly very tiresome to write because it is represented all in zeros and ones:

```
001010001110100100101010000001111
11100110000011101010010101101101
...
```

Machine language seems quite simple on the surface, given that there are only zeros and ones, but its syntax is even more complex and far more intricate than Python. So very few programmers ever write machine language. Instead we build various translators to allow programmers to write in high-level languages like Python or JavaScript and these translators convert the programs to machine language for actual execution by the CPU.

Since machine language is tied to the computer hardware, machine language is not portable across different types of hardware. Programs written in high-level languages can be moved between different computers by using a different interpreter on the new machine or recompiling the code to create a machine language version of the program for the new machine.

These programming language translators fall into two general categories: (1) interpreters and (2) compilers.

An interpreter reads the source code of the program as written by the programmer, parses the source code, and interprets the instructions on the fly. Python is an interpreter and when we are running Python interactively, we can type a line of Python (a sentence) and Python processes it immediately and is ready for us to type another line of Python.

Some of the lines of Python tell Python that you want it to remember some value for later. We need to pick a name for that value to be remembered and we can use that symbolic name to retrieve the value later. We use the term variable to refer to the labels we use to refer to this stored data.

```
    >>> x = 6
    >>> print(x)
    6
    >>> y = x * 7
    >>> print(y)
    42
    >>>
```

In this example, we ask Python to remember the value six and use the label x so we can retrieve the value later. We verify that Python has actually remembered the value using print. Then we ask Python to retrieve x and multiply it by seven and put the newly computed value in y. Then we ask Python to print out the value currently in y.

Even though we are typing these commands into Python one line at a time, Python is treating them as an ordered sequence of statements with later statements able to retrieve data created in earlier statements. We are writing our first simple paragraph with four sentences in a logical and meaningful order.

It is the nature of an interpreter to be able to have an interactive conversation as shown above. A compiler needs to be handed the entire program in a file, and then it runs a process to translate the high-level source code into machine language and then the compiler puts the resulting machine language into a file for later execution.

If you have a Windows system, often these executable machine language programs have a suffix of ".exe" or ".dll" which stand for "executable" and "dynamic link library" respectively. In Linux and Macintosh, there is no suffix that uniquely marks a file as executable.

If you were to open an executable file in a text editor, it would look completely crazy and be unreadable:

```
^?ELF^A^A^A^@^@^@^@^@^@^@^@^@^B^@^C^@^A^@^@^@\xa0\x82
^D^H4^@^@^@\x90^]^@^@^@^@^@^@4^@ ^@^G^@(^@$^@!^@^F^@
^@^@4^@^@^@4\x80^D^H4\x80^D^H\xe0^@^@^@\xe0^@^@^@^E
^@^@^@^D^@^@^@^C^@^@^@^T^A^@^@^T\x81^D^H^T\x81^D^H^S
^@^@^@^S^@^@^@^D^@^@^@^A^@^@^@^A\^D^HQVhT\x83^D^H\xe8
....
```

It is not easy to read or write machine language, so it is nice that we have interpreters and compilers that allow us to write in high-level languages like Python or C.

Now at this point in our discussion of compilers and interpreters, you should be wondering a bit about the Python interpreter itself. What language is it written in? Is it written in a compiled language? When we type "python", what exactly is happening?

The Python interpreter is written in a high-level language called "C". You can look at the actual source code for the Python interpreter by going to www.python.org and working your way to their source code. So Python is a program itself and it is compiled into machine code. When you installed Python on your computer (or the vendor installed it), you copied a machine-code copy of the translated Python program onto your system. In Windows, the executable machine code for Python itself is likely in a file with a name like:

```
C:\Python35\python.exe
```

That is more than you really need to know to be a Python programmer, but sometimes it pays to answer those little nagging questions right at the beginning.

## Writing programs
*Section adapted from Charles R. Severance, "Python for Everybody" [Chapter 1](https://www.py4e.com/html3/01-intro)*

Typing commands into the Python interpreter is a great way to experiment with Python's features, but it is not recommended for solving more complex problems.

When we want to write a program, we use a text editor to write the Python instructions into a file, which is called a script. By convention, Python scripts have names that end with .py.

Say you've created a text file called `hello.py` with the following contents:

```
print('Hello world!')
```

To execute the script, you have to tell the Python interpreter the name of the file. In a Unix terminal (similar to Linux and MacOS terminal in terms of commands) or a Windows command window, you would type python hello.py as follows:

```
python hello.py
```

We call the Python interpreter and tell it to read its source code from the file "hello.py" instead of prompting us for lines of Python code interactively.

You will notice that there was no need to have quit() at the end of the Python program in the file. When Python is reading your source code from a file, it knows to stop when it reaches the end of the file.

## What is a program?
The definition of a program at its most basic is a sequence of Python statements that have been crafted to do something. Even our simple hello.py script is a program. It is a one-line program and is not particularly useful, but in the strictest definition, it is a Python program.

It might be easiest to understand what a program is by thinking about a problem that a program might be built to solve, and then looking at a program that would solve that problem.

Lets say you are doing Social Computing research on Facebook posts and you are interested in the most frequently used word in a series of posts. You could print out the stream of Facebook posts and pore over the text looking for the most common word, but that would take a long time and be very mistake prone. You would be smart to write a Python program to handle the task quickly and accurately so you can spend the weekend doing something fun.

For example, look at the following text about a clown and a car. Look at the text and figure out the most common word and how many times it occurs.

```
the clown ran after the car and the car ran into the tent
and the tent fell down on the clown and the car
```

Then imagine that you are doing this task looking at millions of lines of text. Frankly it would be quicker for you to learn Python and write a Python program to count the words than it would be to manually scan the words.

The even better news is that I already came up with a simple program to find the most common word in a text file. I wrote it, tested it, and now I am giving it to you to use so you can save some time.

```
name = input('Enter file:')
handle = open(name, 'r')
counts = dict()

for line in handle:
    words = line.split()
    for word in words:
        counts[word] = counts.get(word, 0) + 1

bigcount = None
bigword = None
for word, count in list(counts.items()):
    if bigcount is None or count > bigcount:
        bigword = word
        bigcount = count

print(bigword, bigcount)

# Code: http://www.py4e.com/code3/words.py
```

You will need to get through Chapter 10 of this book to fully understand the awesome Python techniques that were used to make this program, but the cool thing is that you don't even need to know Python to be able to use this program. As the end user, you can simply use the program by typing the code into a file called words.py and running it, or downloading the source code from http://www.py4e.com/code3/ and run it.

This is a good example of how Python and the Python language are acting as an intermediary between you (the end user) and me (the programmer). Python is a way for us to exchange useful instruction sequences (i.e., programs) in a common language that can be used by anyone who installs Python on their computer. So neither of us are talking to Python, instead we are communicating with each other through Python.

## Understanding the Python ecosystem
Tools for using Python
- Command line
- Shell
- IDLE
- Ipython
- Jupyter Notebook
- Spyder
- Any text editor (Sublime, VIM, Emacs, etc.)

Packages to learn
- Built-in Python
- Numpy
- Matplotlib
- Pandas

## Using the terminal
Command prompt basics – opening the prompt (terminal or cmder)
Command line operations (cd, ls, mkdir, rm)

## First steps on the Python interpreter
*Section adapted from Charles R. Severance, "Python for Everybody" [Chapter 1](https://www.py4e.com/html3/01-intro)*

Open up a terminal, type `python`, and the fun begins:

```
Python 3.5.1 (v3.5.1:37a07cee5969, Dec  6 2015, 01:54:25)
[MSC v.1900 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>>
```

The >>> prompt is the Python interpreter's way of asking you, "What do you want me to do next?" Python is ready to have a conversation with you. All you have to know is how to speak the Python language.

Let's say for example that you did not know even the simplest Python language words or sentences. You might want to use the standard line that astronauts use when they land on a faraway planet and try to speak with the inhabitants of the planet:

```
>>> I come in peace, please take me to your leader
  File "<stdin>", line 1
    I come in peace, please take me to your leader
         ^
SyntaxError: invalid syntax
>>>
```

Of course, this isn't valid Python syntax. But don't worry about the error - it doesn't do any harm. Let's try something that works on the Python interpreter (valid Python syntax):

```
>>> print('Hello world!')
Hello world!
```

To leave the Python interpreter, you can type:
```
>>> quit()
```

# Excercises