# Configuring PyCUDA with Sample Runs
This notebook describes how I configured a windows-based machine with Python and PyCUDA to invoke CUDA code from within python and provides some sample code. Note that I used Anaconda for installing all Python scientific packages. You can install the requisite packages as you see fit (individually using conda for example).
## My System
OS: Windows 10 Home    
Processor: AMD FX-8150 8 Core 3.6 GHz  
RAM: 8GB    
Video: NVIDIA GeForce GTX 970 4GB 
## Steps for Installing [PyCUDA](http://mathema.tician.de/software/pycuda/)
1. Install [Visual Studio Community 2013](https://www.visualstudio.com/en-us/downloads/download-visual-studio-vs.aspx) 
2. Install [CUDA Toolkit 7.5](https://developer.nvidia.com/cuda-downloads)
3. Install [Anaconda](https://www.continuum.io/downloads) for Windows 64-bit, Python 2.7  
4. Open a DOS command prompt
    1. enter **conda update conda**
    2. enter **conda update anaconda**
5. Download the [PyCUDA install file](http://www.lfd.uci.edu/~gohlke/pythonlibs/#pycuda)
6. Open a DOS command prompt
    1. enter **pip install pycuda‑2015.1.3+cuda7518‑cp27‑none‑win_amd64.whl**
7. Install [Python Tools for VS 2013](https://github.com/Microsoft/PTVS/releases/v2.2)
8. Open a DOS command prompt
    1. enter **conda install boost**
9. Add the following to your PATH environment variable: (e.g., Settings->System->About->System Info->Advanced Systems Settings->Environment Variables->PATH->Edit...)

>C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\bin\;C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\bin\amd64;C:\Program Files (x86)\Microsoft Visual Studio 12.0\Common7\IDE

In [1]:
import pycuda.autoinit
import pycuda.driver as cuda
import numpy as np

from pycuda.compiler import SourceModule
mod = SourceModule("""
__global__ void multiply_them(float *dest, float *a, float *b)
{
  const int i = threadIdx.x + blockIdx.x * 1024;
  dest[i] = a[i] * b[i];
}
""")

source_doublify  =  """
__global__ void doublify(float* a)
{
int idx = threadIdx.x + threadIdx.y*4;
a[idx] *= 2;
}
"""
start = cuda.Event()
end = cuda.Event()

multiply_them = mod.get_function("multiply_them")

a = np.random.randn(102400000).astype(np.float32)
b = np.random.randn(102400000).astype(np.float32)

dest = np.zeros_like(a)

start.record() # start timing
multiply_them(cuda.Out(dest), cuda.In(a), cuda.In(b),block=(1024,1,1), grid=(100000,1))
end.record() # end timing
# calculate the run length
end.synchronize()
secs = start.time_till(end)*1e-3
print "SourceModule time and first three results:"
print "%fs, %s" % (secs, str(dest[:3]))
                   
print "First few of a:"
print a[:5]
print "Last few of a:"
print a[len(a)-5:]
print "First few of b:"
print b[:5]
print "First few of local a*b calculations:"
print a[:5] * b[:5]
print "First few of GPU calculations:"
print dest[:5]

print "GPU result minus local multiplication a*b:"
result = dest-a*b
print len(result)
print result

mod =  SourceModule(source_doublify)
func  = mod.get_function("doublify")
#create vector
x  = np.random.randn(4 ,4)
x  =  x.astype(np.float32)
#copy it on card
x_gpu  = cuda.mem_alloc(x.nbytes)
cuda.memcpy_htod(x_gpu,x)
#call function
func(x_gpu, block=(4 ,4 ,1))
#get data back
x_doubled  = np.empty_like(x)
cuda.memcpy_dtoh(x_doubled,x_gpu)
print "x:"
print x
print "x doubled:"
print x_doubled

SourceModule time and first three results:
0.601702s, [-0.44203371  0.01866707 -0.12238965]
First few of a:
[-1.34425676 -0.02860783  0.90981776 -0.72514886 -1.88500714]
Last few of a:
[-0.37056407 -0.95410705 -0.13737066  0.42489156  0.34473094]
First few of b:
[ 0.32883132 -0.65251607 -0.13452107 -0.6292792  -0.99594241]
First few of local a*b calculations:
[-0.44203371  0.01866707 -0.12238965  0.45632109  1.87735856]
First few of GPU calculations:
[-0.44203371  0.01866707 -0.12238965  0.45632109  1.87735856]
GPU result minus local multiplication a*b:
102400000
[ 0.  0.  0. ...,  0.  0.  0.]
x:
[[ 0.0892044  -1.12173331 -0.46935287  1.99770534]
 [-0.4235895   0.67257589 -0.43842956  0.4212932 ]
 [ 0.50492972  0.61366057  0.37574279 -1.33577192]
 [ 0.92241836 -0.60946316  0.82436442  0.48008668]]
x doubled:
[[ 0.1784088  -2.24346662 -0.93870574  3.99541068]
 [-0.847179    1.34515178 -0.87685913  0.8425864 ]
 [ 1.00985944  1.22732115  0.75148559 -2.67154384]
 [ 1.84483671 -1.21892631  