# Demystifying Data Tools and Technologies

<img src="data/images/Data Tools.gif" width=1000 height=800>

## Goals

By the end of the case, you will have a high level understanding of data pipelines and the tools used to build them. Some key concepts you will become familiar with include:
- data ingestion
- data aggregation
- data cleaning
- data analysis
- data visualization
- structured vs. unstructured vs. semi-structured data
- databases
- scripting
- on-premises vs. cloud infrastructure

## Introduction

Data science is nowadays one of the most recognized fields in the world. It's almost impossible to avoid stumbling upon the term itself in day-to-day parlance in the tech world, and literally impossible to evade the myriad of products derived from the field. Everything from your car to your lipstick, your laptop to the marketing you saw to buy it, are now optimized carefully through the generous study of data.

But this wasn't always the case. Data science only recently came to the forefront, due in large part to the incredible amount of information that we now generate on computer systems every second and to our obsession with cataloging and classifying absolutely all minute details about everyone and everything. Today, we produce about 2.5 quintillion (2,500,000,000,000,000,000) bytes of data daily.



As a frame of reference, many laptop computers can only accommodate 512 gigabytes (512,000,000,000) of memory total!

Though humankind has always been obsessed with keeping records of everything, only recently have we begun to keep all the data in organized manner and in a digital format which allows us to analyze it together. While it's possible our parents kept a wall with marks of our height as we were growing up, it's unlikely they added it to a spreadsheet program, sent it over to a website and had it added up together with the height information of more children out in the world. At most, a few lucky doctors and scientists would keep records of a few thousand children at best.

But with the advent of mobile phones and apps, there are millions and millions of parents doing just that every day while using their favorite app to keep track of their children's stats. And with the Internet and the omnipresence of sensors in everything we own, we can now keep almost real-time tabs on everything, car speeds to food choices, from sport teams results to plant growth per day.

## The data disciplines and pipeline

The professional that manages to find, cook, garnish, and present this information is the **data scientist**. This person combines the knowledge of 3 disciplines in their daily work - Computer Science, Math & Statistics (mainly statistics), and Substantive Expertise / Domain Knowledge.

<img src="data/images/ds_venn_diagram.jpg">


From Math and Statistics, we need the skill of *modeling*, which is the art of representing the world with a mathematical approximation. We need knowledge on how different statistical models work internally, in order to be able to use them where it suits most. Lastly, assorted mathematical knowledge is sometimes needed or very helpful both for creative problem solving and for analysis of results.

From Computer Science we need the very important skill of *coding* or *programming*. We can take our math and statistics knowledge and program it with code in order to be able to give life to our models and get results we can analyze and act upon.

And finally, we need Domain Knowledge in order to both guide our mathematical modeling of the world, to inspire our creative problem-solving for the task at hand, and to be able to correctly recommend actions based on our results.

## Every project begins with a question!

What will be this stock worth tomorrow? What brand of cars appeals more to 18-year-olds? Why is my website traffic plummeting? A question or set of questions allow us to frame each part of a project.

After we have decided we want to answer some question, we will need to build a **data pipeline** that allows us to get answers. A full data pipeline usually requires some combination of the following:

- **Procurement**: In order to use data, we must first find it. This may be as simple as buying it from an existing source, or as complex as creating a whole economic endeavor that mines data from user behavior. Some business problems can be solved with a singular data source, but many other need data from many different sources.  This uses a lot of coding and domain knowledge.

- **Aggregation**: Once we have acquired our data, we must aggregate it. This usually entails creating a process by which we create more focused sets of data from our original information. One example you may be familiar with is getting the grades of a student and his attendance record together in one single spreadsheet. Sometimes this entails very complex processes such as natural language processing, and some others it may be as simple as joining 2 datasets from a simple id. For this we need mainly coding.

- **Cleaning**: We must also clean all this data. Sensors are not perfect, human-inputted values are many times erroneous, flukes of luck and chance affect some outcomes. Whatever the reason, in the real world it is extremely rare to find a dataset without some manner of outliers, data which is suspect. Sometimes before or sometimes after aggregation, we must take care to clean our data of such oddities in order to make sure our conclusions are sound. This uses coding, statistics and domain knowledge in equal measures.

- **Analysis**: After we have orderly and clean data, we can analyze it. This means using a complete assortment of tools in order to extract meaning and actionable insights from our information. While computers help, this is usually the one part of the process on which humans are still unbeatable. This is done with domain knowledge for the most part, with a bit of statistics and coding sprinkled on top.

- **Modeling**: Using our insights and domain knowledge, we can now proceed to create models which allows us to predict future states of the world given the information we've gathered about past states of the world. Models may be as simple as Linear Regressions, or as complex as Ensemble Neural networks. In the end, you may think as all models as a box in which you input an incomplete set of information about the state of a bit of the world -be that the real estate market or the sales projections for the next year- and which outputs a prediction about the present or future of that little piece of existence. All 3 fields of knowledge are mandatory here.

- **Presentation**: Finally, we must explain our process, insights and findings. This is often the most underrated skill in novice data scientists, but in many an elder and wise practitioner's mind is the most important of all. After all, our insights and models will not be very actionable if we can't explain them and convince others to act on them. This includes the verbal, written and graphical representation of what we want to convey. Hence, it is advisable that if one is to learn to be great at something in data, that it is the presentation of it.  Domain knowledge is our main asset.

## The modern data pipeline

> "A data pipeline is a set of actions that ingest raw data from disparate sources and move the data to a destination for storage and analysis. A pipeline also may include filtering and features that provide resiliency against failure." 
>
> Source: [What is a data pipeline?](https://www.stitchdata.com/resources/what-is-data-pipeline/)

<br>

Starting with this simple definition above, we see that the concept of a "pipeline" consists of at least three major phases:

* the **ingestion** of the data from its source
* the **output** or destination of the data (including the storage and analysis)
* the **processing** of the data, including multiple cleaning and transformation steps

Each of these phases can consist of  multiple steps depending on the format of the data, business requirements, etc. For example, if your team needs to analyze data coming from multiple sources, then there needs to be extra processing steps that transform the data sources into a standardized format that allows them to be joined on common fields. Or, if your data pipeline is designed to serve a large dashboard to be used by many different stakeholders, then your processed data will likely need to be loaded into a database at the end of the pipeline that is [optimized for analytics](https://searchbusinessanalytics.techtarget.com/definition/analytic-database).

In [1]:
from IPython.display import IFrame

IFrame('https://www.youtube.com/embed/oKixNpz6jNo', width=560, height=315)

## Starting with the raw data

Before we can begin to understand how data moves through a data pipeline, we must first get a better understanding of the data itself. A large percentage of data that is ingested by data pipelines are considered to be **raw data**, which refers to data that is in its initial state when it is pulled from its source. Raw data has not been processed, so it is likely not in a state where it can be the most useful to your company. Raw data can be information that's recorded, typed, outputted by a machine, or measured by a sensor. Raw data can also be found in a variety of places including databases, files, spreadsheets, audio devices, and everywhere information is exchanged. Some examples of raw data include:

* A list of every item purchased at a store
* A page of written lab notes from a scientific experiment
* A collection of time-lapse pictures taken of the sky over a period of time

You might have heard that data is the "new oil" for organizations operating in the current Information Age. Much like how crude oil doesn't provide much benefit in its raw form, it is through the process of cleaning and refining data that it is transformed into a product which is useful to the end user.

While raw data refers to the state that the data is in, there are a number of different structures and formats that the data can take on including:

- **Structured data.** This refers to data that comes in a tabular format, where there is a relationship between rows and columns. Structured data often conforms to a pre-defined **data model**, which makes analysis on pre-defined fields easier. Some examples of structured data include spreadsheets, SQL databases, and character-delimited text files such as CSV (comma-separated values) and TSV (tab-separated values) files.
- **Unstructured data.** This refers to data that doesn't conform to a particular format or data model, and is typically a lot harder to analyze in its raw form. Examples of unstructured data include written or typed transcripts, log files images, and even audio and video files.
- **Semi-structured data.** This refers to data that does not conform to the tabular structure and data models of structured data, but does contain other attributes that identify different data points. Examples of semi-structured data include JSON and XML files.

### Exercise 1

We've just learned about raw data and the three main forms it can be found in:

* structured
* unstructured
* semistructured

In your current directory, there is a `data/` folder containing 6 text files with different pieces of information. Your task is to examine each text file and determine which of the three formats above best describes the data they contain.

**Answer.** customers.txt and products.txt are structured. web-app.txt and web-app-2.txt are semi-structured. tennis.txt contains both structured and semi-structured data. alice.txt contains completely unstructured data.

-------

## Databases

<img src="data/images/logo_dbs.png">

All this information must go somewhere! Once you procured your data, it must be stored somewhere safe. The same goes for the result of all your transformations. For this purpose, we have **databases**.

Databases are organized collections of information that represent some amount of data. They are usually in the form of **tables** of information, like a spreadsheet, where each row in the spreadsheet is one instance of some part of our data, and each column is a characteristic for that instance. Each database can have many of these tables. Tables are usually linked to each other as well, in what is known as a **relational database**, and it is by far the most commonplace structure of all.

### SQL and PostgreSQL

In order to interact with databases, we have the **SQL (Structured Query Language)** programming language. After Python, SQL is the second most important programming language in the field. It allows us to interact directly with a database, by using a series of commands to insert, delete, update, or query information. Almost every relational database can be interacted with through SQL:

```sql
SELECT * FROM students;
WHERE Name='Leela';
```

There are many variations on the architecture and internal functioning of databases. These variations are known as **Relational Database Management Systems**, or RDBMS for short. You may have heard about some of them, such as MySQL, Oracle, SQL Server and SQLite. Among those, of special note is PostgreSQL. PostgreSQL is free, open-source and extremely powerful, which makes it a very attractive option for new projects. It is the database system used by companies such as Instagram, Twitch, and Skype. It even has a very nice graphical interface which allows you to query and observe your data with little to no coding experience, and is one of the top options to get your feet wet with databases in general. 

### Example 1

Run the following cell to load our database in to the notebook:

**Note:** Don't worry about learning this code, it isn't SQL!

In [2]:
%%capture
!pip install ipython-sql sqlalchemy
import sqlalchemy
sqlalchemy.create_engine("sqlite:///call_center_database2.db")
%load_ext sql
%sql sqlite:///call_center_database2.db

In [3]:
%%sql

SELECT *
FROM agent
WHERE AgentID > 2

 * sqlite:///call_center_database2.db
Done.


AgentID,Name
3,Todd Morrow
4,Randy Moore
5,Paul Nunez
6,Gloria Singh
7,Angel Briggs
8,Lisa Cordova
9,Dana Hardy


### Exercise 2

Try modifying the SQL query below to find all of the agents in the `agent` table whose names start with the letter "A".

In [4]:
%%sql

SELECT *
FROM agent

 * sqlite:///call_center_database2.db
Done.


AgentID,Name
0,Michele Williams
1,Jocelyn Parker
2,Christopher Moreno
3,Todd Morrow
4,Randy Moore
5,Paul Nunez
6,Gloria Singh
7,Angel Briggs
8,Lisa Cordova
9,Dana Hardy


**Answer.** SELECT *

FROM agent

WHERE Name LIKE "A%"

-------

SQL is a powerful tool that allows data professionals to pull data directly from its source. While writing and running one-off SQL queries can be great for creating manual reports, that process is not as ideal in situations where the data has to be queried on a regular basis. This is where **scripting** comes into play, which is a series of computer commands that are able to be executed one after another with no human interaction. Many data-fueled applications use a series of scripts to pull data from a database or other source and feed it into the data pipeline.

## Programming Languages

In order to program something for our computer to do, we must use a language that both we and the computer can understand. This common language is known as a **programming language**, and we have already talked about how SQL is one such language. What a programming language does is translate from your instructions readable by a human like you or me, to instructions readable by the computer, which is commonly known as machine code.

Programming languages have some things in common with languages like English or Spanish. For example (and this is very important!) they are, or really should be understandable by humans. Also, just as with English or Spanish, they have rules in how they should be used.

“I have a wife named Natalia” is a very different phrase from “I have a Natalia named wife”, even when all the words are the same. The same things happen with programming languages - they have rules about what order things should be written. They also have things like punctuation marks such as commas, dots, parentheses, etc!

One difference is that programming languages are much stricter in these rules. If the rules of a programming language are not followed exactly and perfectly, then your program won't compile, which means that the computer was not able to translate your instructions from your programming language into machine code. On the one hand, the fact they are strict makes it easier for different people to read code and makes it unambiguous as to what the code does.

<img src="data/images/programming_languages.jpeg">

Source: [Statista](https://www.statista.com/chart/16567/popular-programming-languages/)

The programming language of choice for data science today is **Python**. (There are others such as R, Scala or Julia, but they lag behind both in market share and in the amount of functionality they offer for data science work.) Python is great for 3 important reasons:

1. It is easy to learn
2. It is easy to read. Many other programming languages are not!

These 2 characteristics have made it great for applications where programming is not the focus of the work, but rather a means to an end (just like Data science)! And finally, our third:

3. Python is **open-source** software

Open-source means that something is free to use in its entirety and furthermore, that other people in the Python community constantly add more functionality to the language for free! It also means that the code behind these functionalities is public and editable by anyone, so if someone feels something can improve, they can go ahead and improve it for the well-being of everyone.

When you are going to code a program, you almost never start from scratch - you usually use code that was built by someone else before you that does a certain thing. For example, if I want to write a program that analyzes data and needs to find the average or mode or other common statistics of a set of numbers somewhere, I don't need to write the function that finds the mode myself! Since it is a common problem, it's likely that someone else already created such a function. I can then download that functionality from the Internet, import it into my program, and use it without much hassle. 

Such functionalities written by other people and usable by anyone in a language are called **libraries** or **packages**. Python has over 200,000 packages, many of which are crucial to data science. For example, there's the `numpy` library, which has functionalities for many basic and not so basic mathematical operations and scientific computation. You want something for interacting with Excel files? You can use `OpenPyXL`. Images? You have `Pillow`. There is something for everything.

### The IDE

An IDE is the main tool you'll be looking at when writing code. Truth is, code files are nothing more than text files, and could be written in your Notepad application, Microsoft Word, or any other text editor. But code does not look aesthetic at all in these text editors, and none of them provide helpful functionality like checking if you wrote a word in the wrong place or allowing you to run the code immediately as you write it.

<img src="data/images/ides.png">

IDEs are text editors on steroids with functionalities that help in the coding process. There are many to choose from, from very minimalistic, bare-bones ones such as VIM, to others chock-full of functionality such as VSCode. One of the most used ones for Python and data science work is Jupyter because it does something almost no other IDE allows - it lets you run code *alongside* text. Like combining a word document with a code file!

<img src="data/images/logo_jupyter.png">

This ability is what makes Jupyter very special for prototyping projects, testing solutions, and creating deliverables that include not only code but also graphs, text, and explanations. Remember when we said that data science needed you to not only code, but also to have domain understanding?

One of the (if not the most important) parts of data science is explaining your results to others, and Jupyter makes this very easy. It does this through a file format called a **Jupyter Notebook**. In a notebook, you have the freedom of a program such as Microsoft Word combined with the power of the Python programming language. If you work on or interact with a data science team, you are virtually certain to see a Jupyter Notebook sooner than later.

## Data Manipulation Tools

Once you have Python and your trusty IDE all set up, it's time to start writing code and manipulating some data. This process involves taking some set of data and finding out interesting things about it. If you had a database with ratings and cast information for all TV shows since the 60s, some interesting questions might be:

- What is the average age of actors starring in shows during each year?
- Who was the youngest starring actor each decade?
- Which actress had the most screen time per year?
- Is screen time correlated with ratings?
- Are there any performers who played in more than 3 genres of shows over their lifetime?

Most people would try to find answers to these questions with Excel, using some long-winded formula or series of pivot tables to find answers to each question. But the results would be hard to use elsewhere. What if you wanted to use the result of your pivot table to build a new table that is joined with another data source? What if you wanted to build an automated process that could find such answers for any similar dataset?

While Excel is an acceptable tool for simple questions, once your needs become more complex you need something more powerful. For these cases, we have a super-hero - `pandas`:

<img src="data/images/logo_pandas.png" width=800 height=500>

`pandas` is a Python package that resembles the functions of Excel, but via Python code. With `pandas`, you can create operations that would take hours of formulas in Excel with only 1 or 2 code lines. Furthermore, the code runs *much* faster in `pandas` and is much easier to automate and export results for future use.

Pandas, at its core, is just a rows-by-columns table representation system, similar to an Excel spreadsheet! Once you have a table of values loaded into it, you can create pivot tables, fill up missing values, find averages, medians, sums, or any other similar statistic, filter, rank, aggregation, etc. One great thing about it is that you can work with more than just numbers and text - a `pandas` **DataFrame**, as its representation of a table is called, can hold anything. It can hold images, sound, video, even things like functions, abstract values, or logic. This allows it to be the perfect tool for working with many types of datasets, with as many columns and rows as your computer can hold (instead of the limits that Excel has). If Python is *the* language of the data science world, `pandas` is at the heart of its culture.

### Example 2

In this example, we will take one of our raw data files, `products.txt`, from earlier in the case and write a series of data cleaning steps in Python by using the `pandas` package. The data is considered raw because each row corresponds to an individual product.

Because packages are pieces of code that have already been written for us, all we have to do in order to use that code in our project is **import** it into our script.

In [5]:
#This green text is called a "comment" in our code. It is purely for annotation/informational purposes, and does not contain
#instructions that will be run in the code

#import numpy and pandas packages
import numpy as np
import pandas as pd

#use the `read_csv()` function from the pandas package to read the contents of the products.txt file
products_table = pd.read_csv("data/products.txt", sep="\t")

#print the contents of the products table
display(products_table)

Unnamed: 0.1,Unnamed: 0,PRODUCT_KEY,PRODUCT_LINE,PRODUCT_TYPE,DESCRIPTION,COST,UNIT_PRICE
0,0,441354,Musical instruments,Digital pianos,Casio Celviano AP270,776.26,1049
1,1,460568,Musical instruments,Digital pianos,Casio Celviano AP470,1139.24,1499
2,2,451845,Musical instruments,Digital pianos,Casio Celviano AP650,1696.50,2175
3,3,270105,Musical instruments,Digital pianos,Casio Celviano AP700,1874.25,2499
4,4,339695,Musical instruments,Digital pianos,Yamaha CLP 785,4118.00,5800
...,...,...,...,...,...,...,...
65,65,246596,Accesories,Pedals and amps,Roland Cube Street EX Amp,389.79,549
66,66,189645,Accesories,Sheet Music,Beethoven piano sonatas (Sheet Music),11.85,15
67,67,284885,Accesories,Sheet Music,Bach’s Well Tempered Clavier (Sheet Music),10.80,15
68,68,291118,Accesories,Sheet Music,Hans Zimmer cinematic orchestrations (Sheet Mu...,79.20,99


As we take a look at the raw data in the `products.txt` file, we see that there is a weird column called `Unnamed: 0` in the beginning of the table. We won't worry about that for now, so we will exclude the column from our table by selecting every column **but** the one in question. We also don't want to keep the `PRODUCT_KEY` column, so we will exclude that one as well:

In [6]:
products_table = products_table.loc[:, "PRODUCT_LINE":"UNIT_PRICE"]

display(products_table)

Unnamed: 0,PRODUCT_LINE,PRODUCT_TYPE,DESCRIPTION,COST,UNIT_PRICE
0,Musical instruments,Digital pianos,Casio Celviano AP270,776.26,1049
1,Musical instruments,Digital pianos,Casio Celviano AP470,1139.24,1499
2,Musical instruments,Digital pianos,Casio Celviano AP650,1696.50,2175
3,Musical instruments,Digital pianos,Casio Celviano AP700,1874.25,2499
4,Musical instruments,Digital pianos,Yamaha CLP 785,4118.00,5800
...,...,...,...,...,...
65,Accesories,Pedals and amps,Roland Cube Street EX Amp,389.79,549
66,Accesories,Sheet Music,Beethoven piano sonatas (Sheet Music),11.85,15
67,Accesories,Sheet Music,Bach’s Well Tempered Clavier (Sheet Music),10.80,15
68,Accesories,Sheet Music,Hans Zimmer cinematic orchestrations (Sheet Mu...,79.20,99


Now that our table looks relatively clean, our next task is to group the products by product line and product type in order to find the average cost and average unit price across all subcategories.

In [7]:
average_products = products_table.groupby(["PRODUCT_LINE", "PRODUCT_TYPE"]).mean()
average_products.columns = ["AVERAGE_COST", "AVERAGE_UNIT_PRICE"]
average_products

Unnamed: 0_level_0,Unnamed: 1_level_0,AVERAGE_COST,AVERAGE_UNIT_PRICE
PRODUCT_LINE,PRODUCT_TYPE,Unnamed: 2_level_1,Unnamed: 3_level_1
Accesories,Pedals and amps,188.896667,255.166667
Accesories,Sheet Music,26.4,33.5
Accesories,Strings,13.715,17.75
Musical instruments,Acoustic pianos,87993.333333,110066.666667
Musical instruments,Brass,928.043333,1257.333333
Musical instruments,Digital pianos,3881.427,5112.4
Musical instruments,Guitars,841.16875,1119.375
Musical instruments,Percussion,509.18,699.0
Musical instruments,Strings,1276.688333,1707.333333
Musical instruments,Synths,1438.471667,1885.5


Now that we have our data cleaned and aggregated according to our specifications, we can now export the data to a cleaned data file to be used for later storage, reporting, or analysis.

In [None]:
average_products.to_csv("data/products_clean.csv")

### Modeling

Now that we have procured, transformed, and saved our information in our trusty database, we are ready to model the world with it. **Modeling** is the art of creating predictions for the future or the present out of observations from the past. The way in which we model the world in data science is very much rooted in statistics - everything from the simplest models to the most complex (e.g. neural networks) is just applied statistics at its core.

One advantage we have as data scientists is that most models have been generalized and coded for us by previous geniuses in the field. This allows almost anyone to apply complex models such as Gradient-Boosted Trees, Neural Networks, Hierarchical Clustering Algorithms, and many others with only a couple lines of code.

#### `scikit-learn`

The Swiss Army Knife of modeling is a Python package called `scikit-learn`. `scikit-learn` allows us to run many types of classification, regression, and clustering models with only a couple of lines of Python code.

However, in order to use it you must know Python (and `pandas`) as well. Linear Regression? Logistic Regression? Decision Trees? Ensembles? `scikit-learn` has it all. Most of its models are part of the bigger field of **machine learning**, in which you input tons and tons of data into your computer in order to get as accurate a model as possible.

A nice thing about `scikit-learn` is its ease of use, which allows even people who have no idea what a model does internally to test it out anyways. Many aspiring data scientists were inspired to learn the math behind the models after seeing the results of them first-hand. This allows even non-math inclined people to dip their toes in machine learning without much time commitment.

### Example 3

In this example, we'll build a simple Linear Regression model using `scikit-learn`.

In [None]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split

data = load_wine()
df = pd.DataFrame(data.data, columns=data.feature_names)
df.head()

In [None]:
from sklearn.model_selection import train_test_split

X = df.iloc[:, :-1].values
y = df.iloc[:, 1].values


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)

print(regressor.intercept_)
print(regressor.coef_)

y_pred = regressor.predict(X_test)

results = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
results.head()

If you'd like to see more, you can continue with this example [here](https://stackabuse.com/linear-regression-in-python-with-scikit-learn/).

#### TensorFlow and PyTorch

The hottest models now in machine learning are **neural networks**. You've probably heard of them - they power your phone's assistant, Google's search, the marketing you see on Instagram, the driving behind a Tesla, and many other current futuristic gadgets.

<img src="data/images/tensorflow-vs-pytorch.png" width=300 height=200>

TensorFlow and PyTorch are competing Python packages that let anyone build their own neural network. While `scikit- learn` can build basic neural networks, TensorFlow and PyTorch let you customize them to the most minute detail, allowing you to truly get the highest performance model possible out of your data.

Both are very commonly used. TensorFlow has found more of a following in applications for private products, while PyTorch has seen more use in academic circles, but they are both capable in any field of use.

You can see a demo of TensorFlow [here](https://playground.tensorflow.org/)

### Git + Github

Once you get used to coding, you'll probably have built up a base of code from all of your cleaning scripts and model building. You could store your code on your local computer, but what if you want to share your code with a friend or colleague? What if they want to make some edits to your code for a different use case? How will you keep track of the different versions of the code amongst your teammates? You can see how the local storage option can get messy pretty quickly.

Thankfully, we have **Git**. Every programmer for any kind of project (data science related or not) must be familiar with Git. Git is a versioning tool. Think of it as Dropbox or Google Drive on steroids. It allows different team members to work on a single code project simultaneously, and to merge their work eventually. 

<img src="data/images/git_diagram.png">

Source: [What is git?](https://blog.cpanel.com/git-version-control-series-what-is-git/)

With Git, we can have a complete record of all previous states of our project, so that we can roll back to a previous version. We can also have several different versions or **branches** of our project at once and **merge** them eventually, which allows many different developers to work independently of each other and still manage to create a final product which is the sum of their efforts.

Git can be used without Internet access, but when you want to share your project with others, you may want to do it online. **GitHub** is a provider that allows you to upload a Git tree of your project and to manage different versions online. These projects are known as **repositories** or repos for short. Almost every open-source project in the world shares its source code as a Git repo in GitHub, which allows anyone else to contribute.

## Data Visualization

During the time we spend together with our information, we must make sure to get to know it deeply. Seldom do we find datasets that offer all their valuable insights freely. By analyzing our information in detail, we can gather insights which allow us to act on the information, to better develop our data pipeline, to improve our models, and to emerge triumphant in our quest of finding answers to our goal question. 

To aid us in this endeavor, we have tools which allow us to visualize the data. These were some of the first pieces of software built for the field, back when **Business Intelligence (BI)** was all the rage. Excel is probably the most known (but also nowadays the most scorned). After Excel, many other great tools came out that made data analysis and visualization a breeze.

### Business Intelligence Software

<img src="data/images/logo_tableau_power_bi.jpg" width=800 height=500>

Source: [Power BI vs. Tableau](https://aptude.com/blog/entry/power-bi-vs-tableau-for-bi-data-visualization/)

The most used of these tools are designed for building reports out of our data, allowing us to visualize it and find relationships. To be blunt on their usefulness, everything is much easier to look at when the code is replaced by pretty graphs, which applies not only to the data analyst but to everyone else in the company, since it is not the data analyst but the rest of the team who must be able to understand the information through the analyst's work.

Working with these tools seldom needs coding experience, so they are also very good tools to get started with in the world of data. Many senior data scientists today started out as data analysts using BI software. Some of the most famous of these pieces of software are Tableau and PowerBI.

Both of them have similar features, allowing you to present your data in interactive reports that are easily explainable. Together with a competent data scientist, they can make explaining your data pipeline, insights, models, and results a breeze.

(As a footnote, Excel sometime can work for very small projects. If we were in your shoes though, we would strongly prefer to get out of the Excel mindset and start using tools such as these instead.)

Here are the public [Tableau](https://public.tableau.com/en-us/gallery/) and [Power BI](https://community.powerbi.com/t5/Data-Stories-Gallery/bd-p/DataStoriesGallery) galleries for you to see the power of each!

## Infrastructure

Many parts of this work, in particular data transformation and modeling, require a decent amount of computing power. This means that in order to keep your data pipeline flowing, you must acquire machines capable of running all the processes and machinations of your data team.

The set of computers needed to run your data pipeline is known as your **infrastructure**. The computers that make up your infrastructure each has its own specifications - its processor, the amount of RAM and disk space it has available, whether or not it has a GPU and what kind, among some other details. These requirements will grow as you build more data-intensive and complex models. The field of work relating to managing the infrastructure for a data pipeline is known as **DataOps**.

Now, your infrastructure can be local (commonly referred to as **on-prem**) or in the **cloud**. When your infrastructure is local, it means that you bought or rented computers and have them available geographically close to you (i.e. you may have your own data center full of servers). This means that you must maintain them, keep them cool with AC, pick out the parts for each computer, and keep a team to make sure everything is running smoothly.

The alternative is having everything in the cloud. This means nothing more than renting a computer on someone else’s premises and connecting to it over the Internet. Then, you run all your processes on that computer the same as if it was local, but you don't have to worry about maintenance costs and such.

Having local infrastructure is usually finicky, time-consuming and has a large upfront cost, which makes it infeasible for major projects. It is for this reason that most professionals nowadays rent their infrastructure and run their projects on the cloud from any of several cloud providers.

#### AWS

<img src="data/images/logo_aws.png">

Amazon Web Services, or AWS, is the most well-known of the current cloud providers. It offers computing power, storage, and many other specialized services so that you can think about your business instead of hardware.
AWS may be the most famous, but its far from the only one. Other options include Google Cloud Platform, Microsoft Azure, among others.

## Conclusion

With this case, you've become acquainted with some of the many tools at the disposal of a data professional. In your day-to-day life working alongside data professionals, you will gain exposure to these tools first-hand, and maybe even one day use some of them yourself.

The most important thing to keep in mind is this - the heyday of Excel is long gone. In any high-performance team, data professionals nowadays have many advanced tools which they can access easily and for little cost other than a decent computer. So if you expect to gain value out of your data, make sure to stay with the times and hire or foster skilled data teams that can handle the current tools of the trade.

In a future case, we will teach you all about the structure of data teams, the different roles, their functions, and specific skillsets so that you may create such a team from scratch or model your existing teams accordingly.