# 구글링: python for data analysis 2nd edition pdf github

# Chapter 1  Preliminaries

## 1.1 What Is This Book About?

This book is concerned with the nuts and bolts of manipulating, processing, cleaning, and crunching data in Python.

### What Kinds of Data?

- Tabular or spreadsheet-like data in which each column may be a different type
(string, numeric, date, or otherwise). This includes most kinds of data commonly
stored in relational databases or tab- or comma-delimited text files.
- Multidimensional arrays (matrices).
- Multiple tables of data interrelated by key columns (what would be primary or
foreign keys for a SQL user).
- Evenly or unevenly spaced time series.

## 1.2 Why Python for Data Analysis?

### Python as Glue
- Part of Python’s success in scientific computing is the ease of integrating C, C++, and FORTRAN code.  
- Many companies and national labs have used Python to glue together decades’
worth of legacy software.

### Solving the “Two-Language” Problem
- In many organizations, it is common to research, prototype, and test new ideas using
a more specialized computing language like SAS or R and then later port those ideas
to be part of a larger production system written in, say, Java, C#, or C++.   
- What people
are increasingly finding is that Python is a suitable language not only for doing
research and prototyping but also for building the production systems.   
- Why maintain
two development environments when one will suffice? 

### Why Not Python?
- In an application with very low latency or demanding
resource utilization requirements (e.g., a high-frequency trading system), the time
spent programming in a lower-level (but also lower-productivity) language like C++
to achieve the maximum possible performance might be time well spent.(As Python is an interpreted programming language, in general most Python code will
run substantially slower than code written in a compiled language like Java or C++.)
- Python can be a challenging language for building highly concurrent, multithreaded
applications, particularly applications with many CPU-bound threads.

## 1.3 Essential Python Libraries

### NumPy
NumPy, short for Numerical Python, has long been a cornerstone of numerical computing
in Python.   
It provides the data structures, algorithms, and library glue needed
for most scientific applications involving numerical data in Python.   
NumPy contains, among other things:
- A fast and efficient multidimensional array object ndarray
- Functions for performing element-wise computations with arrays or mathematical
operations between arrays
- Tools for reading and writing array-based datasets to disk
- Linear algebra operations, Fourier transform, and random number generation
- A mature C API to enable Python extensions and native C or C++ code to access
NumPy’s data structures and computational facilities

Beyond the fast array-processing capabilities that NumPy adds to Python, one of its
primary uses in data analysis is as a container for data to be passed between algorithms
and libraries.

### pandas
pandas provides high-level data structures and functions designed to make working
with structured or tabular data fast, easy, and expressive.  
The primary objects in pandas that will be used in this book are the DataFrame,
a tabular, column-oriented data structure with both row and column labels, and the
Series, a one-dimensional labeled array object.  
Many features found in pandas are typically either part of the R
core implementation or provided by add-on packages.

### matplotlib
matplotlib is the most popular Python library for producing plots and other twodimensional
data visualizations.

### IPython and Jupyter
The IPython project began in 2001 as Fernando Pérez’s side project to make a better
interactive Python interpreter.  

In 2014, the IPython team announced the Jupyter project, a broader
initiative to design language-agnostic interactive computing tools. The IPython web
notebook became the Jupyter notebook, with support now for over 40 programming
languages. The IPython system can now be used as a kernel (a programming language
mode) for using Python with Jupyter.

### SciPy
SciPy is a collection of packages addressing a number of different standard problem
domains in scientific computing. Here is a sampling of the packages included:

- scipy.integrate  
Numerical integration routines and differential equation solvers

- scipy.linalg  
Linear algebra routines and matrix decompositions extending beyond those provided
in numpy.linalg

- scipy.optimize  
Function optimizers (minimizers) and root finding algorithms

- scipy.signal  
Signal processing tools

- scipy.sparse  
Sparse matrices and sparse linear system solvers

- scipy.special  
Wrapper around SPECFUN, a Fortran library implementing many common
mathematical functions, such as the gamma function

- scipy.stats  
Standard continuous and discrete probability distributions (density functions,
samplers, continuous distribution functions), various statistical tests, and more
descriptive statistics

### scikit-learn
Since the project’s inception in 2010, scikit-learn has become the premier generalpurpose
machine learning toolkit for Python programmers.

### statsmodels
statsmodels is a statistical analysis package that was seeded by work from Stanford
University statistics professor Jonathan Taylor, who implemented a number of regression
analysis models popular in the R programming language.

## 1.4 Installation and Setup

### Windows
To get started on Windows, download the Anaconda installer.

### Installing or Updating Python Packages
At some point while reading, you may wish to install additional Python packages that
are not included in the Anaconda distribution.   

In general, these can be installed with
the following command:  
<p style="font-family: Courier New; font-size: 1.15em;"> 
conda install package_name

If this does not work, you may also be able to install the package using the pip package
management tool:  
<p style="font-family: Courier New; font-size: 1.15em;">
pip install package_name

You can update packages by using the conda update command:  
<p style="font-family: Courier New; font-size: 1.15em;">
conda update package_name

pip also supports upgrades using the --upgrade flag:  
<p style="font-family: Courier New; font-size: 1.15em;">
pip install --upgrade package_name


<img style="float: left;" src="pic/pic01.png">

- <span style="color:red">While you can use both conda and pip to install packages,   
you should not attempt to update conda packages with pip,   
as doing so can lead to environment problems.   
- <span style="color:red">When using Anaconda or Miniconda,
it’s best to first try updating with conda.

### Python 2 or Python 3

### Integrated Development Environments (IDEs)

- PyDev (free), an IDE built on the Eclipse platform
- PyCharm from JetBrains (subscription-based for commercial users, free for open
source developers)
- Python Tools for Visual Studio (for Windows users)
- Spyder (free), an IDE currently shipped with Anaconda

## 1.5 Community and Conferences

- pydata: A Google Group list for questions related to Python for data analysis and
pandas
- pystatsmodels: For statsmodels or pandas-related questions
- Mailing list for scikit-learn (scikit-learn@python.org) and machine learning in
Python, generally
- numpy-discussion: For NumPy-related questions
- scipy-user: For general SciPy or scientific Python questions

- PyCon and EuroPython: The two main general Python conferences in North
America and Europe, respectively
- SciPy and EuroSciPy: Scientific-computing-oriented conferences in North America
and Europe, respectively
- PyData: A worldwide series of regional conferences targeted at data science and
data analysis use cases
- International and regional PyCon conferences (see http://pycon.org for a complete
listing)

## 1.6 Navigating This Book

- **Interacting with the outside world**  
Reading and writing with a variety of file formats and data stores  

- **Preparation**  
Cleaning, munging, combining, normalizing, reshaping, slicing and dicing, and
transforming data for analysis  

- **Transformation**  
Applying mathematical and statistical operations to groups of datasets to derive
new datasets (e.g., aggregating a large table by group variables)
- **Modeling and computation**  
Connecting your data to statistical models, machine learning algorithms, or other
computational tools
- **Presentation**   
Creating interactive or static graphical visualizations or textual summaries